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Welcome ... 


to the third issue of the Journal for 2001. For this issue we have selected four papers on a wide 
variety of topics. Fei Li and Yan Lui discuss smoothing algorithms for MPEG video, Wei Li and 
Minjie Zhang look at methodologies for supporting mobile agents, Shihoon Cho, Joonwon Lee and 
Joongsoo Ma propose a new extension to flash memory file systems and Faye Borthick, Paul 
Bowen, Fiona Rohde and Meng Liu report on an experiment designed to test whether database 
normalisation affects the way in which business users detect anomalies in data. 


Short Communications 

Apart from the full papers, this issue of the Journal contains the first of what I hope will be many 
Short Communications. These short papers will generally be comments and insights into topical 
issues or short or interim results that will be of wide interest. They are limited to four pages in length 
and in order to ensure that they are turned around quickly, are reviewed rather than fully refereed. 
In this issue John Lamp discusses the Code Red virus and the “epidemic” experienced in July and 
August this year. 


New Books Editor 

I am also pleased to welcome Dr Michael Oudshoorn from Adelaide University to the editorial team 
as our new Books Editor. Michael succeeds David Williams who, after many years as books editor, 
has embarked on some exciting and very different activities and therefore decided that his term as 
books editor had to end. We are extremely grateful to him for his work with the Journal over the 
years. All book reviews should be sent to Michael at michael @cs.adelaide.edu.au. 


A Third of a Century of Publication 
The next issue, the last issue for Volume 33, represents a milestone in the Journal’s history — one 
third of a century of publication. This makes the Journal one of the oldest continuously published 
computing journals in the world. As there will not be an editorial from me in the next issue, (it is 
also a special issue for other reasons), I would therefore like to pay tribute to the editors, associate 
editors and referees over this long period. Particular thanks should be extended to the ever-flexible 
Keith Collins and his team at Associated Business Publications, the production managers of the 
Journal since its inception in 1967. Finally, thanks should go to the Australian Computer Society for 
its unwavering support of the Journal over the years, particularly to Tom Worthington as 
Publications Board Director. 

The next issue will contain a printed list of all papers published over its 34 year history; this list 
can also be found at the Journal’s website — http://www. jrpit.acs.org.au/. 


John Roddick, FACS 
Flinders University 


 — 
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On Smoothing Algorithms for Transmission 
of Stored MPEG Video 
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In this paper, we consider smoothing algorithms in delivering stored varied-bit-rate 
(VBR) MPEG video over high-speed network. The required network bandwidth of a 
video service significantly affects its cost. Video smoothing techniques promise 
reduced bandwidth variablity and consequently reduced bandwidth requirement. 
Because of varying frame size, MPEG video introduces burstiness during 
transmission, when left unalleviated, will affect network management and resource 
utilisation. Various smoothing algorithms have been proposed to tackle bursty 
transmissions, giving optimal transmission schedules under various constraints. The 
first part of the paper gives a survey of the state-of-art smoothing algorithms for 
stored VBR videos. In the second part of the paper our work on satisfying each video 
bandwidth request under limited available bandwidth is presented. 

Keywords: Smoothing algorithm, statistical multiplexing, temporal burstiness, 
Video-on-Demand 


1. INTRODUCTION 

A broad range of applications, such as Video-on-Demand (VOD), digital TV/HDTV broadcasting, 
and multimedia image/video database service, requires the playback of stored videos over high- 
speed network. Usually these media objects must be displayed on the client side at a specific time, 
space, and rate. For coherent reception and continuous playback at the client, strict quality of 
service (QoS) must be provided in an end-to-end manner. Currently, most videos are encoded using 
the MPEG standard (Sikora, 1997). Although encoding in MPEG reduces storage space and 
amortised transmission bandwidth requirement, it induces bit-rate variations due to variable frame 
size, non-casual decoding sequence and inter-frame dependency. Significant rate variability 
complicates the design of efficient real-time storage, retrieval and transport mechanisms while 
achieving high resource utilisation. In light of the general real-time design principle “less bursty 
(i.e. smoother) workloads are easier to handle.” (Salehi et al, 1996), various smoothing algorithms 
have been proposed, and fall in the following three categories. 


1.1 Temporal Smoothing (Work-Ahead Approach): 
Work-ahead algorithms try to smooth out the source bandwidth requirement through intelligent 
scheduling. Based on the knowledge of the frame size and the sequence, as well as buffer size on 
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the client side, the server prefetches video frames and heuristically deliver them to the client buffer 
before playback to reduce potential future burstiness. These smoothing algorithms are commonly 
evaluated on metrics such as, the number of bandwidth increases, total number of bandwidth 
changes, variability of bandwidth and peak bandwidths required (Feng and Rexford, 1997), etc. For 
certain situations, they provide the optimal solutions. 


1.2 Spatial Aggregation (Statistical Multiplexing): 

Statistical multiplexing 1s a spatial aggregation mechanism in which several individual streams are 
asynchronously superimposed and transported over the same channel (Krunz, 1999). The resulting 
aggregate traffic exhibits smoother bit rate behavior (i.e., less variability) than individual streams. 
Therefore, the bandwidth is allocated to the aggregate traffic, which may be much less than the sum 
of the per-stream allocated bandwidths. There are three different approaches with statistical 
guarantees, they differ in the type of source model under which the statistical multiplexing gain is 
evaluated. Another guarantees, determined guarantee, can be provided based on the traffic stamp 
proposed in (Krunz and Tripathi, 1997). 


1.3 Temporal Multiplexing 

Temporal multiplexing is achieved by introducing one or more per-stream buffers into the switches 
along the end-to-end path. We can see that this scheme is paramount to increasing the buffer 
capacity of the client. Such a scheme can reduce the stream’s peak rate, but it is at the cost of 
introducing more delays before playback. In temporal multiplexing the server is not involved in the 
scheduling, and thus is not covered in this paper. 

Although smoothing algorithm may reduce bandwidth requirement significantly, there are still 
moments when available bandwidth is insufficient to transmit all the videos without loss of quality. 
Various algorithms have been proposed to determine the tradeoffs between bandwidth utilisation 
and quality of picture, both before and after video encoding. Those algorithms concerning 
video encoding before transmission will perform frame dropping, encoding frames hierarchically 
etc., to reduce bit rate. They introduce distortion at different levels. Algorithms concerning stages 
after encoding will use rate shaping, delay introduction, etc. Rate shaping is applied during transmis- 
sion when conflict happens and delay is introduced either at the beginning of playback or in the 
conflict interval. 

The rest of this paper is organised as follows. In Section 2 and Section 3 give overviews of the 
organisation of MPEG video streams and a stored video transmitting system respectively. Section 4 
presents state-of-the-art smoothing algorithms for stored VBR video streams in, focusing on 
temporal and spatial smoothing. In Section 5, we discuss some work concerning smoothing under 
limited bandwidth in. In Section 6, we conclude with an introduction to other works on smoothing 
algorithm and future research directions. 


2. ORGANISATION OF MPEG VIDEO STREAMS 

MPEG is a standard for compressing video and audio files. It addresses the compression, 
decompression and synchronisation of video and audio data. In MPEG, video and audio are 
compressed individually, and synchronisation is achieved through time stamps. Since video 
sequences contain a significant amount of statistical and subjective redundancy within and between 
frames, MPEG employs digital video-coding techniques to reduce such redundancy (Sikora, 1997). 
In this paper, we deal only with video compression. 
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MPEG uses a hierarchy of data structures in video encoding. Each video sequence begins with 
a sequence head and ends with an end-of-sequence code. Each sequence is divided into several 
groups of pictures (GOPs). Each GOP is a series of one or more pictures. A picture is the basic 
encoding unit. Each picture is made of slices. In general, a slice has a few macro-blocks and each 
macro-block has four blocks (MPEG). Figure 1 illustrates MPEG data organisation. 

MPEG relies on two techniques to reduce spatial and temporal redundancies - Discrete Cosine 
Transformation (DCT) and Block-Based Motion Estimation. DCT coding techniques are employed 
on image blocks of 8 x 8 pixels to efficiently explore spatial relations between nearby pixels within 
the same image. 





-_—_____________-Video Sequence 


=a Group of Picture —+ 


Picture 


8 Pixels 


Figure 1: MPEG data hierarchy 


The DCT algorithm transforms the 8 x 8 pixels from the spatial domain to the frequency domain. 
In the frequency domain, the 64 DCT coefficients are quantilised in a zigzag order to reduce bit rate. 
To achieve inter-frame redundancy, motion estimation technique computes motion vectors for every 
16 x 16 macro-block. The motion vectors of a frame are predicted from its past and/or future 
reference frames over a search window. 

In MPEG there are three types of pictures: /ntra frame (I-frame), Predicted frame (P-frame) and 
Bi-directional frame (B-frame). J-frames are coded using information present only in the picture 
itself [MPEG]. P-frames are coded with respect to a past /-frame/P-frame and will be used as a 
reference for future predicted frames. B-frames are coded with reference to a past frame and a future 
frame and they can provide the highest degree of compression. P-frames and B-frames consist of 
motion vectors and/or DCT coefficients. These frames are organised into GOPs (group of picture). 
A GOP is led by an /-frame followed by B-frames and P-frames (e.g., JBBPBBPBBPBBPBB). 
Figure 2 illustrates MPEG encoding order. 

There are two important facts concerning the MPEG compression standard. One is the generic 
structure of the MPEG standards. Users can specify the picture size and frame rate to support a wide 
range of application profiles. The other is that the MPEG group only standardises the decoder 
structure and the bit-stream formats. This allows a large degree of freedom for manufacturers to 
optimise coding efficiency by developing innovative encoding algorithms even after the standards 
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are finalised. These two properties are described by two characteristic parameters: the inter-frame 
to intra-frame ratio, N, and the quantiser scale parameter, g. Pancha and Zarki (1993) have shown 
these properties in the communication environment. They investigated ways in which a 
combination of varying bandwidth allocation and layered coding can be used to obtain better 
performance. Under limited bandwidth, a choice of these parameters is a trade-off between 
bandwidth utilisation and quality of picture, and results in lossy compression. 


Group of Picture 


B Frame a P Frame 


Figure 2: The arrangement of the picture coding types within the video sequence 





In the subsequent sections, we use a discrete-time model ¢t €{1, 2,..., nm} to characterise the 
transmission period of a MPEG video. A MPEG stream 5; is assumed to consist of n; frames, where 
frame j requires f, bytes of storage. Tao Pinav Binax denote the maximum sizes of J, P, B frames in 
s, respectively. Normally, 


ax?’ 


holds in most cases. In addition, we define L' and Q! which describe the GOP pattern of s;, where, 
L' is the number of frames between two consecutive /-frames in an MPEG stream; Q' is the number 
of frames between an /-frame and subsequent /-frame or P-frame . 


3. SYSTEM STRUCTURE FOR TRANSMITTING STORED VIDEO: AN OVERVIEW 

In a generic multimedia tele-presentation system, there will be one or more media servers, a 
communication network and one or more clients. A set of media objects are either generated at the 
server from live sources, or pre-recorded in files, or a mix of both. Here, we introduce the system 
for stored video. 

Figure 3 depicts the system structure for transmitting stored video. Video is stored at the server, 
and delivered to the client through network upon request. The server is responsible for the 
scheduling of retrieval and transmission operations. For stored video, the server has the knowledge 
of the frame sizes of each video as well as client buffer size. 
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Figure 3: The basic system for the transmission of stored video. 


At connection setup time, a certain amount of bandwidth is reserved. The server dynamically 
allocates bandwidth according to the smoothed profile of the video. Smoothing buffers are at the 
client sides. Frames arriving early are stored in the buffer and played back later. This duration is 
called start-up delay, which is to reduce bandwidth requirement for the initial few frames. The 
transmission rate through the network channel must be adjusted such that the buffer will never 
overflow or underflow. The system model used in this paper is a server transmits multiple, distinct 
VBR streams to clients across a high-speed network (Jiang and Kleinrock, 1998). 

Under constant-bit-rate (CBR) network service, every video stream is allocated a fixed 
bandwidth, normally its peak bandwidth, for the duration of the entire transmission. This leads to a 
simple static scheduling algorithm for the video server. However, this results in the waste of 
bandwidth since the average playback rate is normally much lower than the peak rate used in 
reserving the bandwidth. Furthermore, such approach will cause client buffer overflow especially 
during the period of low playback rate. To avoid low network utilisation, Chen et al (1995) 
proposed a scheme that allocates a fixed bandwidth that is lower than the peak rate. As a result, 
during the transmission, this approach selectively discards some insignificant frames whenever the 
bandwidth cannot support the playback rate. Contrary to static scheduling approaches, piecewise- 
CBR and VBR network services offer the possibility of dynamically scheduling the transmission 
rate for bandwidth smoothing. CBR simplifies resource allocation and admission control 
procedures, while piecewise-CBR and VBR utilise resources more efficiently. To save resources 
and ease management, smoothing algorithms are employed in piecewise-CBR and VBR service 
networks. Two main smoothing approaches that do not introduce delay are work-ahead approach 
and statistical multiplexing. 


4. SMOOTHING ALGORITHMS FOR STORED VIDEO STREAM 

Smoothing algorithms can be either work-ahead approach or statistical multiplexing. The work- 
ahead approach is based on a stream-by-stream principle while statistical multiplexing is based on 
a multiple streams principle. For the work-ahead approach, smoothing algorithms have different 
optimisation objectives. For example, if the same bandwidth is reserved for the whole transmission, 
they may want to minimise the peak transmission rate or the rate variability of the smoothed video 
(Feng, Jahanian and Sechrest, 1997; Salchi et al, 1996). In a system where bandwidth can be 
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negotiated during the session, it is usually desirable to minimise the number of rate changes (Feng 
and Sechrest, 1995). For a video on demand system, it is important to support the VCR-function 
such as rewind, fast-forward, fast-backward etc., which can be implemented more efficiently if the 
requirements on the smoothing buffer and the buffer occupancy are minimised. To smooth 
burstiness, the server allocates a different amount of bandwidth at different time intervals for the 
requested video. The common feature of these algorithms is that: at each run (interval), the 
transmission rate can last without causing either buffer overflow or underflow. They differ in how 
to allocate the starting point for the next run after the rate for the previous run has been determined. 
Here, we give a brief review of the existing algorithms under certain constraints. These algorithms 
are applied to stored VBR videos. 


A. Work-ahead Approach 

A work-ahead approach performs temporal bandwidth smoothing on a stream-by-stream basis, 
assuming that the server has exact knowledge of the client buffer capabilities and the video 
sequence. 

Consider a discrete-time model at the frame level. Suppose the video has n frames, the size of 
the smoothing buffer is b. To guarantee continuous playback of a MPEG stream, a server must send 
at least F’,,¢e(t) cumulative bytes over [1, ¢] interval to avoid client buffer underflow (Feng and 
Rexford, 1997). 


© sve (t) - fi 


Meanwhile, the client buffer can receive at most F,,,,,(t) cumulative bytes over [1, ¢] without buffer 
overflow, 


Fover (t) = b+ VF; 


t 
and F’,,.,(1) = b. Consequently, a feasible transmission schedule S(t)= pa c; must satisfy 


U gailep (t) = S(t) = Foyer (t) 


where c; is the cumulative data transmitted by the server over [i - 1, i]. The area bounded by F.,,,,, 
and F’,,,7er 18 a feasible area to schedule the transmission. Any feasible schedule, as shown in Figure 
4, in this area will not cause buffer underflow and overflow problems. 

Obviously, there exists a set of feasible schedules, I’, for the transmission of the MPEG streams. 
To video server researchers, the problem is how to find a feasible schedule, S* € T,which is as 
smooth as possible. The peak rate and variance of a schedule S are defined as, 


peak(S) = max f, 


var(S) =a 
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Figure 4: Transmission Schedule 
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Figure 5: Allocation new bandwidth 


where f = x _ J‘. An optimal schedule S* should have minimum peak rate and variation, i.e., peak(S") 
< peak(S) and var(S") < var(S) for any schedule S. Notice that a CBR-schedule S = [f, .. . , f] is optimally 
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smooth, however, the CBR schedule in Figure 4 is infeasible, and suffers from buffer underflow during 
certain periods. 

Creating a feasible schedule, in general, involves creating m consecutive runs. In Figure 4, the 
feasible schedule has m = 4. These runs join together to form a monotonically increasing piecewise- 
linear path; each run r; will allocate “| bandwidth for a duration. The smoothing algorithms select 
the starting point for run 7; + , based on the trajectory of run r;. For instance, in Figure 4, if r, does 
not have its rate decreased, the schedule will cause buffer overflow. Therefore, the starting point of 
r; is selected with F< fh | 

Figure 5 illustrates the examples. Basically, there are two selection schemes, 

1. Locate a starting pointing for 7; + ; as early as possible before the trajectory of 1; intersects F,,,5; 
or Finder (Salehi et al, 1996). | 
2. Search along the trajectory of r; to locate the starting point that allow r; + ; to extend as long as 


possible (Feng, Jahanian and Sechrest, 1997) 

The former approach minimises the variability of bandwidth and peak rate with O(n’) running 
time; while the latter approach minimises the number of bandwidth changes, m, with O(n’logn) run 
time. 

Feng and Rexford (1997) compared the performance of several smoothing algorithms: critical 
bandwidth allocation (CBA) (Feng and Sechrest, 1995), minimum change bandwidth allocation 
(MCBA) (Feng, Jahanian and Sechrest, 1997), and minimum variability bandwidth allocation 
(MVBA) (Salehi et al, 1996). For any rate changes, the CBA algorithm starts a rate decrease at the 
longest trajectory for run j+1, based on the selected starting point and initial buffer occupancy, and 
chooses the starting point as early as possible when it is to increase bandwidth. This results in a 
transmission plan that has the smallest possible peak bandwidth requirement and the minimum 
number of bandwidth increases. MCBA extends the CBA scheme that choose the trajectory point as 
far as possible regardless of increasing or decreasing bandwidth requirement. This results in a 
transmission plan with the smallest possible number of rate changes and minimum peak bandwidth 
requirement. MCBA minimises m resulting in saving of the cost of renegotiating bandwidth 
requirement across the network, but the saving is at the expense of increased size of each request. 
Salehi et al (1996) have proved that MVBA can schedule an optimal transmission plan, which 
minimises the peak rate requirement and variance of bandwidth. As a result, MVBA is able to 
gradually alter the bandwidth requirement. In practice, nevertheless, the work-ahead approach 1s 
limited by the requirement of real-time responses to the changes of a client buffer, and to the 
interactive VCR functions. MVBA approach minimises the variability of bandwidth and peak rate 
with O(n) running time; while CBA and MCBA approaches minimise the number of bandwidth 
changes, i.e., minimise m, and the peak bandwidth requirement with O(n’logn). 

In contrast to CBA, MCBA and MVBA, the piecewise constant rate transmission and transport 
(PCRTT) algorithm creates bandwidth allocation plans by dividing the video stream into fixed-size 
intervals. This O(n) algorithm generates a single run for each interval by connecting the intersection 
points on the F,,,7e, Curve, as shown in Figure 6. The slopes of these lines correspond to the rates r; 
in the resulting transmission plan. To avoid buffer underflow, the PCRTT scheme vertically offsets 
this plan until all of the runs lie above the F,,,4¢- curve. Raising the plan is equivalent to introducing 
an initial playback delay at the client; the resulting transmission curve also determines the minimum 
acceptable buffer size to avoid overflow given the interval size, as shown in Figure 6. Because the 
PCRTT algorithm determines the buffer size from the bandwidth plan, it can be difficult to calculate 
an appropriate interval size given a fixed buffer b (McManus and Ross, 1996). An extension to the 
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PCRTT algorithm (PCRTT-DP) employs dynamic programming to calculate a minimum-cost 
transmission plan that consists of m runs. Focusing on the buffer size b as the cost metric, PCRTT- 
DP iteratively computes the minimum-cost schedule with & runs by adding a single rate change to 
the best schedule with k — 1 rate changes. This algorithm has a time complexity of O(n’). 
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used plan 
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Figure 6: PCRTT plan creation 


Table I summaries these five algorithms under several metrics. For more information about the 
metrics and their constraints on applications, please refer to Feng (1997). In this table, ““Yes” means 
the algorithm optimises the corresponding metric, “No” means otherwise. 


Algorithm Peak Rate Variability in Number of Periodicity Buffer Running Time 
Bandwidth Bandwidth Utilisation (Worst Case) 
Changes 


MVBA Near 50% O(n) 

MCBA Near 50% O(n2logn) 
CBA Lowest O(n2logn) 

PCRTT No Highest O(n) 


PCRTT-DP Near MCBA Near 50% O(n?) 





Table I: Performance of five algorithms in spatial smoothing 


B. Statistical Multiplexing 
Statistical multiplexing (SM) is a mechanism to reduce the bandwidth requirement of bursty and 
VBR traffic sources (Krunz, 1999). It is being used in ATM networks in conjunction with QoS 
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guarantees. In essence, statistical multiplexing is a spatial aggregation mechanism in which several 
individual streams are asynchronously superimposed and transported over the same channel. The 
resulting aggregate traffic exhibits smoother bit rate behaviour, 1.e., less variability, than the original 
streams. Bandwidth is allocated to the aggregate traffic, resulting in a reduction in the per-stream 
allocated bandwidth. An example of statistical multiplexing mechanism is shown in Figure 7. 


Public network Head-end 
| switch 


Video e a 


stream 


Video —~> 


stream ~ 7 _ Clients 





Figure 7. Example of statistical multiplexing at a video server 


Traditionally, SM is used inside network switches and routers. It can also be employed at end 
system. For example, a remote video server may use SM to maximise the number of simultaneous 
video connections transmitted from the server to a head-end (HE) switch over a fixed-bandwidth 
pipe, see Figure 7. In this scenario, multiplexing is performed among streams that are destined to 
the same head-end switch, which in turn demultiplexes the traffic into several streams. 

Other scenarios for SM are possible. For example, the distribution network may consist of 
several local servers in addition to a central remote server that maintains a repository of compressed 
videos. A subset of these videos is periodically multicast to the local servers. Each local server is 
connected to a head-end switch that serves a pool of clients. In this scenario, SM can be performed 
over the high-speed link connecting the remote server to each local server, and also over the link 
between a local server and the associated HE switch (Krunz, 1999). 

Two guarantees called statistical guarantee and deterministic guarantee are provided in the 
statistical multiplexing mechanism. 

Three different approaches can be used to provide statistical guarantees through statistical 
multiplexing. They differ in source models used in evaluating statistical multiplexing gain. 

1. The first and most common model one uses is the stochastic source models. This model is useful 
in assessing the transport performance for a video whose traffic profile is not known at the time of 
admission control. Based on the queuing analysis of a video traffic model, the server determines the 
minimum amounts of bandwidth and buffer needed to guarantee a certain level of QoS. Bandwidth 
gain is attained by allowing the sum of the peak rates of the input streams to exceed the service rate 
of the multiplexer. This model can be classified according to the persistence of their auto-correlation 
structure into four models. They are: renewal model (no correlations), Markovian and 
autoregressive models (exponentially decaying autocorrelation), sub-exponential models (the decay 
of the auto-correlation is slower than exponential but faster than a power function) and long range 
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dependent (LRD) models (autocorrelation decays as a power function). Interested readers can 
explore more in Izauierdo and Reeves (1999). 

2. Instead of detailed stochastic models, video sources can be specified by stochastic bounds. For 
example, if X(t) is the number of cell arrivals in any interval t, a stochastic bound could be given by 


Pr[X(t) > @] = 


for several values of t, 0, and é€,. End-to-end guarantees are obtained by first obtaining stochastic 
bounds on the traffic at the edge of the network, which are then used to limit the ongoing traffic 
from that node. In turn, the bounded departure traffic of one node is used to bound the arrival traffic 
at the next node, and the procedure is repeated for all nodes along the path. A drawback of this 
approach is that the bounds become loose as more nodes are traversed. 
3. In contrast to the above two approaches, a video source can be characterised by deterministic 
time-invariant traffic envelope. When several of these traffic envelopes are statistically multiplexed, 
the multiplexing gain depends on the relative phase shifts of these envelopes. Since the envelopes 
are time-invariant, their phase shifts can be assumed to be random, giving rise to statistical QoS 
guarantees. 

Statistical multiplexing with deterministic guarantees was proposed by Krunz and Tripathi 
(1997). They used traffic-envelope model to smooth the delivery of VBR MPEG streams. The 
traffic-envelope of a stream s, is a time varying periodic function b(t) characterised by 


E'= (Fnav Prov Bmax E's Q') 


where the first three parameters denote the maximum size of /-frame, B-frame, P-frame in s,, and 
the last two parameters describe the GOP pattern of s; (Section 2). Under the above traffic envelop 
characterisation, MPEG sources can be statistically multiplexed at the video server and transported 
to a head-end switch with minimal bounded delay and no losses. The server maintains an active 
table of the traffic envelopes of all ongoing videos. These envelopes are updated when a new stream 
is added or an ongoing one is terminated. U= [u,, u,... Uy] is described as the schedule start-up times 
of multiple streams at the network switch. The effective bandwidth of N streams is defined as Ng 
and Song (1997), 


Cay max{D b(t u)} 


where ¢ — u; is an index to the traffic envelope of stream s; and u = [uy, U,..., uy] describes the 
schedule of multiple streams at the network switch. Suppose all the streams are homogeneous 
(L; = L for all 7) and all traffic envelopes are the same. Then 


C= ( * Ina +n* Pua + k* Bung} 


where m, n, k are the number of frames belongs to /-frame, P-frame, and B-frame i alae and 
m+n+k=N. If Equation (1) holds, then 


C=. 
which means that the effective bandwidth is often less than the peak rate. Krunz and Tripathi (1997) 
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further proved that there exists an optimal scheduling u* which gives minimal C* if all streams are 
homogeneous. 
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Figure 8. Traffic envelope with L=6 and Q=3 
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Figure 9. An example of statistical multiplexing of two MPEG sources 


C. Hybrid Approach 

In general, integrating work-ahead approach and statistical multiplexing can offer a certain degree 
of improvement in performance. In Zhang et al (1997), investigated the statistical multiplexing gain 
of a work-ahead schedule. They first conducted experiments to study the effect of client buffer size 
on the work-ahead schedule. As expected, larger buffer size produces smoother schedule. Then, 
they investigated the impact of a work-ahead schedule on auto-co-relation structure. A smoother 
work-ahead schedule has higher auto-co-relation segments (runs). Empirical results indicate that 
potential statistical multiplexing gain may diminish if the video streams are high-co-related. This 
implies: the hybrid approach is more suitable for clients with smaller buffer size, the statistical 
multiplexing gain diminishes if most of the clients display the same movies during a certain period. 


5. TRANSMISSION UNDER LIMITED BANDWIDTH 

In the realm of video-on-demand, video usually has to be transmitted under limited bandwidth. 
Methods employed can be classified using delay and distortion they introduce. In general, a 
compromise between quality of picture and delay is needed to satisfy the limitation of bandwidth. 
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A. Without Delay or Distortion 

Feng (1997) proposed rate-constrained bandwidth smoothing (RCBS) to deal with rate constraint. 
It is for VCR functions for interactive video-on-demand system. Given a maximum rate constraint 
r, any frame that exceeds the rate constraint is modified to the maximum rate constraint and then 
pre-fetched. To minimise the peak rate requirement, as well as to maximise buffer utilisation, RCBS 
first uses O(n) MVBA to determine the smallest possible peak rate for a given buffer size. Then, 
this rate constraint is used in computing the RCBS schedule. Given the fixed maximum bandwidth 
rate, this algorithm minimises the smoothing buffer residency times, the clients and servers can 
remain more tightly coupled making VCR functions easier to support. 

In general, the server transmits requested video according to pre-computed smoothed profile. 
Under the circumstances that a conflict happens between the bandwidth profile and the smoothed 
profile, the server has to re-compute the smoothed profile, which will incur a lot of real-time 
computation. In Liu, Lee and Li (1999), presented a method to test conflict intervals and update the 
original smoothed profile. Instead of running smoothing algorithm, smoothed profile is updated 
using the current bandwidth profile at the conflict intervals. The limitation of this method is that it 
is only applied when bandwidth is sufficient for strict playback but conflict happens between 
smoothed profile and current bandwidth profile. 


B. Without Delay and With Distortion 

Transmission without delay can be achieved through introducing distortion. Three methods are used 
in introducing the delays. The first one is to transmit a subset of the streams if it has been scalably 
compressed. This is based on the characteristic of MPEG compression method. The second is to 
delete some less important frames, which can be achieved by subjective redundancy. Ng and Song 
(1997), modified the algorithms proposed in Salehi et al (1996) and proposed two schemes in 
dealing with transmission under limited bandwidth. One of the two schemes is to delete some 
frames. For the smoothing of MPEG-1 videos, they prefer to delete B-frames based on previous 
experiments that show subjective effects without loss of quality. If the bandwidth required still 
exceeds the bandwidth available, they will delete PF also. The third scheme is to insert a rate shaper 
between the encoder and the communication network. The rate shaper re-quantise bit rate 
dynamically according to the bandwidth constraints. This work is to find the optimal breakpoint 
during the zigzag quantisation. Methods for finding the optimal breakpoint can be found in 
Eleftheriadis and Anastassiou (1995). 

Tse and Liew (1995) considered the degradation of image quality when the network can only 
support an aggregated bandwidth A which is less than the total requested bandwidth. Let A be the 
number of bits in a video stream s; that must be retained in order to maintain an image distortion 
level D, they optimise A by distorting all images to the same degree. Namely, 


YA (D5=A 
i=1 


such that all s; are guaranteed to have the same distortion D’ as a result of insufficient bandwidth. 
Their proposed approach progressively discards insignificant DCT coefficients while aggregately 
multiplexing video streams until equality of Equation 12 is reached. In practice, it may not be 
possible to achieve absolute equality of distortion level because of the discrete nature of bits or 
groups of bits that are dropped. 
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C. With Delay and Without Distortion 

Delay can be introduced at the beginning or at the conflict interval during transmission to preserve 
quality under limited bandwidth. Li and Liu (1999) proposed a fast algorithm in deciding minimal 
delay at the beginning of playback to satisfy limited bandwidth without loss of quality. They chose 
safe intervals for the runs of the smoothed profile of the requested video. The intersection of all safe 
intervals makes the safe interval for the transmission. The delay chosen is the smallest value in the 
intersection, which returns the shortest delay. 

The delay can also be introduced at the conflict intervals. In Salehi et al (1996), only when a 
conflict happens, are streams paused. 

Case 1: If the last segment transmits at the highest feasible constant rate, the client plays back 
continuously until the buffer is empty. Then client pauses for a negotiated time and starts the next 
segment. 

Case 2: If the last segment transmit at the lowest feasible constant rate, the client pauses for a 
negotiated time and then starts requesting the next segment. 

Jiang and Kleinrock (1998) provides a promising work. A model with parameters describing the 
performance of the video service is provided. At first, the performance is represented as nodes with 
each node denoting the performance of an interval during transmission. Then, with the constraints 
and changes of the environment, the server will search among the nodes for the optimal 
performance. That is, the server tries to minimise the cost in transition. Costs are tagged on the arcs 
from one node to another. Thus, finding the transition between points is converted to finding the 
shortest path. Jiang and Kleinrock (1998) used dynamic programming to find the best way through 
the points. But since the number of nodes are limited, they may not be the optimal all the time. The 
number of nodes needed in the describing system depends on the constraints of the system. 


6. CONCLUSION 

In current video applications, more smoothing needs to be done to satisfy different constraints. For 
Video-on-Demand systems, we have to consider operations such as forward, backward, resume and 
stop during strict playback. These operations require smoothing with the least buffer utilisation. 

Based on practical considerations of heterogeneous and dynamically variable client buffer sizes, 
variable worst-case network jitter estimates, and client interactivity, Rao and Raghavan (1999) 
presented an algorithm for pre-computing and storing the optimal schedule for all possible client 
buffer sizes to satisfy an on-line computation of the schedule of transfer. Many other works are done 
in smoothing stored video steams, such as Koprulu, Meliksetian and Chen (1997), which uses 
thresholds to indicate the current occupancy of the buffer and decide whether to increase or decrease 
the bandwidth. 

Many of the studies reviewed here have shown that the schedule must provide an intelligent 
algorithm to satisfy requests at some QoS level, especially adapting the changing environment 
constraints. Till now, most studies have been focusing on the server side. A few topics for future 
research are, 

1. The work-ahead approach assumes that the server has the prior knowledge of the client buffer 
size. But it is not always true in practice since the buffer available for video applications varies. 
Thus, how to adjust the schedule at the client’s side in order to get better performance? 

2. Statistical multiplexing is used in many environments. However, under limited bandwidth, it is 


not preferred, as it cannot reduce bandwidth requirement. Therefore, how to adapt the residual 


bandwidth to save transmission time and resources? 
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3. All the algorithms target at finding the best tradeoffs between QoS and utilisation of resources. 
If hierarchical structure is used, how to make good scheduling between these QoS level? 

4. Some researchers attempt to predict the bandwidth requirement to smooth video. Based on the 
current and past display and video size, the server predicts the upcoming traffic and adjusts the 
bandwidth requirement, however, this requires more statistical modelling. As a result, modifying the 
concept of predicting bandwidth requirement to adapt to various environments is a difficult issue. 
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1 INTRODUCTION 


It is believed that a new star in the sky of distributed object-oriented systems is the emerging field 
of mobile agents (StraBer, Baumann and Hohl, 1997). There are many mobile agent systems (Magic, 
1998; Gray, 1996; Lange and Oshima, 1998; Object Space) that have been developed since the first 
prototype, Telescript (White, 1997), was proposed for e-commerce in 1994. There are many 
ongoing discussions about developments of new mobile agent systems, but few have focused on 
both mobility and autonomy (Bic, Fukuda and Dillencount, 1996; Cai, Gloor and Nog, 1996). In an 
application of mobile computation (Gray, et al, 1997; Kotz, et al, 1997), a user often launches a 
mobile agent from a laptop that is connected to the Internet, then the user disconnects the laptop 
from the Internet. The mobile agent travels in the Internet autonomously, retrieving and updating 
information locally on behalf of its owner. Later, the mobile agent will return to the user’s laptop 
and report the results when the user’s laptop is reconnected to the Internet. Mobile agents should 
have the “intelligence” to perform self-contained navigation and computation, giving mobile agents 
the powers to adapt to dynamic and heterogeneous networks. This is necessary because in most 
cases mobile agents cannot interact with their owners. However, it is not easy to define what 
autonomy is and how to provide mobile agents with autonomies. 

This paper focuses on providing agents with autonomies in the context of mobility. In this paper, 
new concepts, paradigms, and models are proposed and studied as the bases of Mobile Agent 
Template (MAT) and an agent structure, task description and hosting environment are integral parts 
to support the autonomies of mobile agents. The contributions of this research are that we adopt the 
workflow as agents’ working mode; we use goal-directed and dynamic task planning to deal with 
the heterogeneity and dynamism of networks; and we construct a virtually platform-independent 
environment to provide mobile agents with asynchronous, anonymous and fully localised 
interactions. 

This paper is organised as follows: In Section 2, three kinds of autonomies are defined for 
clarifying the meaning of autonomy in the context of mobility. In Section 3, the autonomous 
methodologies, which are Distributed Task Plan (DTP), Active State Space (ASS) and mechanisms 
for fully localised interactions, are proposed for designing the architecture of MAT. In Section 4, 
the implementation mechanisms of MAT, including autonomous primitives, DTP planning, and 
reflections of ASS are addressed. Finally in Section 5, this paper is concluded and directions of 
future research are given. 


2 WHAT ARE AUTONOMIES OF MOBILE AGENTS? 

Autonomy is a difficult concept to pin down precisely, but “it means that the system should be able 
to act without the direct intervention of humans (or other agents), and that it should have control 
over its own actions and internal state” (Jennings, Sycara, and Wooldridge, 1998). In the context of 
mobile agents, it is even more difficult to define autonomy. Having investigated some Web 
applications such as Internet information retrieval and mobile computation for which mobile agents 
are suitable (Ciancarini, Tolksdorf and Vitali, 1998; Rus, Gray and Kotz, 1997), we identify the 
necessary features that an autonomous mobile agent should have. By autonomy, a mobile agent is 
self-contained in navigation, computation and communication when transporting through the 
underlying heterogeneous and dynamic networks. An autonomous mobile agent should be aware of 
where to go to find messages or resources, what actions to perform for getting information, and how 
to deal with exceptions when its performances fail. In the context of autonomy, it should be enough 
for mobile agents to interact with communication partners only in an asynchronous, anonymous and 
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localised mode for inter-agent cooperation or resource access. Based on the above features of 
mobile agents, we have defined the following three types of autonomies in the following subsection. 


2.1 Definitions of autonomies 
Definition 1: Mobile autonomy is the capability of self-navigation of mobile agents through the 
underlying networks. 


The mobile autonomy of mobile agents should be inherent. Mobile agents can be capable of 
making decisions about their destinations, navigation modes and where to dock when a destination 
is not reachable. The navigation modes could be serial or in parallel. Normally, a mobile agent com- 
pletes a distributed task by visiting a series of network nodes serially and processing queries and 
updating information locally. However, a mobile agent can clone itself and assign a subtask to each 
of it’s duplicates for executing a distributed task concurrently when multiple resources are available 
and a distributed task can be divided into concurrently executing subtasks. The solutions of subtasks 
from different mobile agents are combined into a result finally. A mobile agent should also know 
where to dock temporarily in order to avoid getting lost when a network connection is broken or 
unavailable, such as the disconnection of a laptop computer from the network. So autonomous 
mobile agents are dynamically network-aware entities on which Web applications are constructed. 


Definition 2: Computational autonomy is that a mobile agent can get enough computational 
functions for accomplishing a distributed task. 


A mobile agent should make use of all kinds of computational resources. A mobile agent can 
invoke functions in different processes or load functions into its process for execution when those 
functions reside at its visiting network nodes. A mobile agent can also complete a special 
computational task by executing its own functions carried by the agent when visiting a network 
node with desirable resources but without necessary computational functions. 


Definition 3: Communicational autonomy is that a mobile agent can send and receive messages in 
an asynchronous, anonymous and indirect way. 


2.2 Discussion and classification of interactions 
However, no matter what degree of autonomy they have, mobile agents need inter-agent 
interactions for cooperation and agent-hosting execution environment interactions for resource 
accesses because any autonomous mobile agent exists in a community, which may be composed of 
other agents and heterogeneous hosting environments. Besides the interactions with their owners at 
their home machines, autonomous mobile agents need at least two kinds of interactions during their 
lifecycles. 

1. Inter-agent interactions are needed when a Web application consists of more than one mobile 
agent. Let’s consider the following scenario. When a mobile agent realises that its task can be 
concurrently executed, for example, a mobile agent finds other n interested links in a HTML 
page while retrieving contents of the page, the mobile agent will duplicate n-/ instances with 
respective retrieval tasks. The agent and its n-/ duplicates will transport to n interesting sites for 
further information retrievals. Jnter-agent interactions occur when results from different agents 
are synthesised into a final result for submission to the agent’s owner at the home machine. 

2. Agent-hosting execution environment interactions are needed when a mobile agent accesses 
resources and services in a local execution environment of network nodes it visits. All the 
information in a hosting execution environment must be structured for mobile agent accessing, 
and protocols for retrieving or updating information must be carefully defined. 
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There are new features in interactions among autonomous mobile agents or between 
autonomous mobile agents and hosting execution environments. In the context of autonomy, inter- 
actions should be fully asynchronous, anonymous and localised between communication partners. 
Firstly, the asynchronism does not require a strict agreement on communication time while the 
synchronism implies that any interaction between communication partners depends on complex 
protocols and synchronous mechanisms. The code load required for mobile agents to deal with their 
protocol, and synchronous mechanism is too high. Mobile agents would have too many code loads 
for the protocol and the synchronous mechanism accomplishments. Secondly, if an autonomous 
mobile agent must know the communication partners’ name for interaction, naming services would 
be involved in agent communications. The naming services result in extra network (not local) inter- 
actions between mobile agents and naming service agencies. Too many interactions might weaken 
the autonomy we pursue in many Web applications. Also, it is not necessary for an autonomous 
mobile agent to know other agents’ names because it is enough for an agent to interact with others 
as long as it can identify that its communication partners and itself belong to the same application. 
Finally, any interaction should be localised. An important advantage of mobile agents over other 
distributed technologies is the locality of distributed information processing. Locality leads to 
advantages in terms of high efficiency, reliability and flexibility. So interactions between 
autonomous mobile agents and their communication partners should not introduce any extra 
network communication. 

In the concept of wireless networks (low band) and mobile computation (partially connected 
devices), minimisation of network communications and localisation of interactions are the desirable 
goals of MAT. To achieve those aims, we cannot use the traditional communication models, such as 
the direct communication model and the rendezvous model (Acharya, Ranganathan and Saltz, 1996; 
Baumann, Hohl, and Radouniklis, 1997; Gray 1996), or the meeting-oriented model (Magic, 1998; 
Peine and Stolpmann, 1997). Instead, we design a new model: ASS (Active State Space), which is a 
third party to facilitate interactions between autonomous mobile agents and their communication 
partners for cooperation or resource accesses. 


3 HOW TO SUPPORT AUTONOMIES: METHODOLOGY OF MAT 
In this section, we demonstrate how to introduce desirable autonomies into mobile agents by using 
the virtual autonomy-supporting environment and the autonomy-supporting work mode. 


3.1 Abstract description to heterogeneous execution environments in MAT 

Mobile agents exist in heterogeneous networks such as the Internet. Resources are formatted and 

managed with different modes. Accessing different system resources directly by mobile agents 

results in the following limitations: 

1. Increasing complexity of the application-level programming. It is evident that programmers 
must write different code for accessing different type or the same type of resources. For 
example, an html file is described by path /user/wli/document/index.html in UNIX and described 
by path c:\wli\document\index.html in Windows. In order to access resources in heterogeneous 
networks, programmers should be familiar with different networks and operating systems. 

2. Increasing code loads carried by mobile agents. The more heterogeneous the network that a 
mobile agent travels in, the heavier the code load that a mobile agent carries. A mobile agent 
with a too heavy code load is not a compact entity suitable for Web applications in the context 
of minimising network communications to deal with low band and unreliable networks such as 
the wireless network. 
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3. Decreasing reusability of agent codes. Any effort to extend a mobile agent-based application 
will result in the modification and addition of agent code. 

In MAT, we use Active State Space (ASS) to provide an integrated description of heterogeneous 
resources that are accessed by mobile agents. ASS abstracts resources of different systems into data 
integrity so as to deal with the heterogeneity and the dynamism of the hosting execution 
environments. Mobile agents use virtual resources to access the real resources. ASS is responsible 
for mapping between abstract resources and real resources. ASS provides mobile agents with 
platform-independent resource accesses. In this point, ASS is quite similar to JVM, which provides 
platform-independent Java instructions to programmers. Another motivation for use of virtual 
resources is to provide a secure interface between mobile agents and system resources. It is 
necessary for ASS to perform security checks on some sensitive operations such as deleting files 
because a malicious or badly programmed agent may possibly contain dangerous code. Comparison 
of direct resource access to virtual resource access is graphically illustrated in Figure 1. 
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Figure 1: Comparison of direct resource access to virtual resource access 


3.2 Asynchronous and localised interactions between communication partners in MAT 
Asynchronous, anonymous and localised interactions are main features that must be satisfied in the 
context of Web applications based on autonomous mobile agents. In MAT, we use Resource Space, 
a component of ASS, to coordinate actions of mobile agents for cooperation in a task 
accomplishment. We use the following scenario to demonstrate the asynchronous, anonymous and 
localised coordination model for inter-agent interactions. 

When a mobile agent PA knows that its distributed task DT can be executed concurrently on 
destinations D;, Do, ...... , D,-1, and Dn., the agent PA (called Parent) duplicates n-/ instances SAj, 
DADs. caxaiee , and SA,-; with respective subtasks S77, ST», ...... , and S7,,.;. Each duplicate will 
transport to D;, Do, ...... , and D,,.; respectively, and finally, the parent agent PA will transport to 
the destination D,, with subtask ST,,. Subtasks S77, ST», ...... , ST,,.; and ST, can be the same or 
different. If our aim is to exploit distributed resources for executing distributed tasks for fast 
response, the subtasks should be same, such as, getting the same program table from different TV 
web sites. If our aim is to exploit distributed resources for retrieving different information, the 
subtasks should be different, such as, getting different price table of the same travel from different 
travel agencies. Because SA;, SAp, ...... , SAn-7, and PA concurrently execute subtasks S77, S72, 
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ee , ST;,-1, and ST, divided from DT, the results from different mobile agents need to be combined 
into a final result at a network node Site on which agents agree. Results of subtasks $77, ST>, ST,,_7, 
and ST, from different agents can be described by the following Atomic Record type of Active State 
Space. 


SAR=<String Results, Integer ApplicationID, Integer TaskID> 


lf S77, ST, ....: , 97;,.; and ST, are the same, agents SA 7, SAp, ...... , SA,.7, and PA will compete 
for being alive at the same network node Site. The agent, which arrives at Site first, is regarded as 
the first one to accomplish this special subtask. This agent will be continuously alive, and others 
will exit their execution. It can be identified by the following Atomic Record as the Label in 
Resource Space of ASS whether an agent just transporting to Site is the first one to accomplish the 
special task. 
LAR=<String State, Integer ApplicationID, Integer TaskID> 


If a mobile agent cannot find the LAR in the current ASS, it is the first one to accomplish the 
special task identified by ApplicationID and TaskID, then it will put a LAR into the ASS. If a mobile 
agent finds the LAR in ASS, it is not the first one to accomplish the special task and will exit its 
execution. 

If S77, ST, ...... , ST,.; and ST, are different, all the results of subtasks ST7, ST», ...... es ee 
and ST, from different agents SA;, SA, ...... , SA,-7, and PA need to be combined into a final result. 
The parent agent PA is responsible for the solution synthesis. The duplicated agents SA7, SAg, ...... 
and SA,-_; will exit their executions after putting their results into the current Resource Space of ASS 
at Site. The inter-agent interactions of the above scenario are graphically illustrated in Figure 2. 
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Figure 2: Inter-agent interactions by Resource Space of ASS 
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3.3 Working mode of workflow for mobile agents in MAT 

An autonomous mobile agent should be able to exhibit goal-directed behaviours, which depends on 

a dynamically planned, continuous and autonomous workflow. In order to provide mobile agents 

with an autonomous workflow, we must solve two problems: 

1. Maintaining the persistence of an agent’s thread. This means the whole mapping (including the 
execution point of the thread) of an agent object must be restored after the agent migration. The 
weak mobility model, which is widely used in many mobile agent systems, such as Odyssey 
(Magic, 1998), Voyager (Object Space), Java-To-Go (Li and Messerschmitt, 1996), Aglets 
(Lange and Oshmina, 1998), Facile (Thomsen, Leth and Prasad, 1993), Tocoma (Johansen, 
Reuese and Schneider, 1995), Mole (Object Space) and Grasshopper (IVK, 1999), is not suitable 
for supporting the persistence of agents’ threads. An agent’s execution can only start from the 
beginning or from a special method rather than the stopped point after agent migration, so the 
persistence of an agent’s thread cannot be maintained by weak mobility models. On the other 
hand, the strong mobility supports the persistence of agents’ threads by transparently capturing 
the mapping of an agent’s thread when an agent wants to migrate, transporting the captured 
mapping and restoring the transported mapping after the agent migration. But the strong 
mobility is often constructed on special languages. Systems based on special or specification- 
modified languages, such as Telescript (White, 1997), Agent Tcl (Gray, 1996), Ara (Peine and 
Stolpmann, 1997) and Sumatra (Acharya, Ranganathan and Saltz, 1996), are accepted by 
multiple platforms with difficulty. An evident example is that General Magic rewrites its mobile 
agent system, Telescript, into Odyssey by using Java in order to be widely accepted by 
heterogeneous platforms. 

2. Permitting the modification of its behaviours when a mobile agent transports under the 
computational networks. This means a mobile agent cannot only execute a predefined workflow, 
but also modify or reconstruct its workflow in order to deal with the heterogeneity and 
dynamism of networks. Behaviours of an agent are exhibited by its code. Dynamic 
reconstruction of a workflow means that the agent kernel and its behaviours, which are defined 
by a user’s objectives, should be loosely coupled and cannot be programmed in the same 
program unit because an agent needs to modify its code for adapting to dynamic networks. 
Limitations of current technologies of agent migration necessitate the design of a new agent 

system model, MAT, for supporting autonomous mobile agents. In MAT, we program mobile agents 

with Mobile Thread Programming Model (MTPM). Contrary to many strong mobility technologies, 

MTPM is an application-level rather than a system-level model, that simulates the persistence of 

threads after agent migrations. An advantage of MTPM is that it is highly efficient. MTPM can 

simulate the persistence of an agent’s thread without capturing, transporting and restoring the thread 
stacks and heaps with huge system-dependent information. MTPM does not need any modification 
to JVM specifications, and it uses two new mechanisms, Serialisation and Remote Method 

Invocation (RMI), provided by Java 1.1 and later releases, to provide persistence to an agent’s 

thread. MTPM is outlined in Subsection 4.2. When it is created, a mobile agent plans its Distributed 

Task Plan (DTP) that is the implementation model of MTPM. An execution of a DTP generates a 

continuous and autonomous workflow for a mobile agent. When executing a DTP, a mobile agent 

can transport in networks, retrieve resources at remote hosts and asynchronously interact with other 
agents. At any time when current DTP fails, a mobile agent can replan a new DTP according to 
dynamic networks for an alternative accomplishment of a user’s objectives. A DTP is a reusable 
template that is a static description of a distributed task for the execution by a mobile agent. An 
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execution of a DTP is an autonomous workflow for the agent. Whenever a mobile agent needs to 
modify its behaviours for adoption of the dynamism of networks, it can reach the goal by replanning 
its DTP. DTP is outlined in Subsection 4.2 and detailed in (Li and Zhang, 1999). The work mode of 
mobile agents in MAT is graphically illustrated in Figure 3. 
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Figure 3: The planning, execution and replaning of DTP 


4 IMPLEMENTING AUTONOMIES: MECHANISM OF MAT 

4.1 ASS: distributed supporting environment 

There are three motivations for the design of ASS. The first one is to provide communication 
facilities for autonomous mobile agents. Interactions among agents, by means of ASS, are not 
involved in any network communication and naming service. Interactions among agents are fully 
asynchronous, anonymous and localised. That means a mobile agent can put messages (called 
Record objects) in ASS, which another agent can get later. The second one is to provide a uniform 
description of heterogeneous resources that are accessed by mobile agents. ASS abstracts the 
resources of different systems into the data integrity so as to deal with heterogeneity and dynamism 
of the hosting execution environments. Mobile agents use virtual resources for accessing the real 
resources. ASS is responsible for mapping between abstract resources and real resources. The last 
motivation is to provide a secure interface between mobile agents and system resources. It is 
necessary for ASS to perform security checks on some sensitive operations such as deleting files 
because a malicious or bad-programmed agent possibly contains dangerous code. 


4.1.1 Architecture of ASS 
ASS is composed of Resource Space, Security Space and Operation Space. Virtual resources or 
messages for inter-agent communications are stored in the Resource Space. Any requirement to 
access Resource Space submitted to ASS by a mobile agent will fire an Operation object in the 
Operation Space. For the acceptance or rejection of a mobile agent’s requirements, the Operation 
object needs to match the requirements with Security Rules in Security Space. This interaction 
procedure between mobile agents and hosting execution environments is called Reflection of the 
Operation Object in ASS. So every Operation object has a segment of Reflection codes. A success- 
ful reflection will change the states of ASS, and a failed reflection yields no influence on the ASS. 
In an ASS, any component, such as a virtual resource, a security rule or an operation, can be 
installed and deleted dynamically by the ASS itself or mobile agents. The components, installed by 
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the ASS itself, are the common resources or protocols for all the agents and usually exist for a long 
period. The components, installed by agents, are the special resources or protocols for an 
application and usually exist for a short period. Components in the above three spaces are integrated 
into abstract description called an Active Record as defined below. 


Definition 4: A Record in an ASS is composed of objects that are called sub-records, 1.e. 
<Record>::=<{subrec,}*>, where subrec denotes any sub-records which compose the Record. 


Definition 5: A Record is an Atomic Record if its every sub-record is not a Record object. 
Definition 6: A Record is a Coupled Record if at least one of its sub-records is a Record object. 


A file in a local file system can be structured with an Atomic Record object with the type of 
< String path, String Type, Date UpdateTime, Username owner>. Then the Atomic Record object of 
<’/user/wli/document/’, ‘html’, 2/22/99, wli@cowan.edu.au> represents system file resources: all 
the Atml files that are located in the directory of user/wli/document/, owned by user 
wli@cowan.edu.au and updated in February 22, 1999. 

A message, which a mobile agent puts in an ASS for communications with other agents, can be 
structured with an Atomic Record with the type of <String Results, Integer ApplicationID, Integer 
TaskID>. Every application has a unique identification ApplicationID, and every distributed task of 
the same application has a unique identification TaskID. Any agent can get the results of a 
distributed task by providing the ApplicationID and TaskID to the mechanism of the resource 
matching of the ASS. 

A security rule of ASS can be structured with a Coupled Record object with the type of </nteger 
AgentID, String Operation, AtomicRecord Rec>. Then the Coupled Record object of <*, ‘r’, Rec;> 
represents that any agent can read Record Rec;. Another Coupled Record object of <agentid, ‘rdu’, 
Rec ]> represents that mobile agent identified by agentid can read, delete and update Record Rec}. 

Each Application ID, Task ID or Agent ID is 128 bits, and we use a special algorithm to generate 
IDs. Our ID algorithm is very similar to the algorithm of Microsoft for generating COM GUID 
(Microsoft). Microsoft demonstrated that the algorithm could generate different IDs in 3240 years 
at the generating speed of 1000000 IDs per second. 

An Operation object of an ASS can be structured with an Atomic Record object with the type of 
<String OperationName, Operation OpObj, Username Owner>. Where the OperationName is the 
name of the Operation, OpObj is the instance of the Operation and the Owner is the entity (agent 
or local system) to install the Operation. Every Operation object has a reflection method. The 
reflection method is executed when the Operation object is called by ASS. 

A retrieval requirement from a mobile agent to a hosting execution environment can be 
structured with a Coupled Record object with the type of </nteger AgentID, String Operation, 
Boolean Synchronisation, AtomicRecord Rec>. Then the Coupled Record object of <agentid, ‘r’, 
true, <’/user/wli/document/’, ‘html’, *, *>> represents that a mobile agent, identified by agentid, 
wants to read all the Atm files in the directory of /user/wli/document/ in a synchronous mode. If the 
resource matching and the security inspection succeed, the ASS will return a vector of references to 
the required him files. 


4.1.2 The reflection mechanism of ASS 

ASS has three Object Spaces, which are Resource Space, Security Space, and Operation Space. The 
Resource Space stores Atomic Records. These records abstractly describe system resources of the 
hosting execution environments or messages that mobile agents put for asynchronous interactions. 
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The Security Space stores Coupled Records. These records describe security rules that determine 
which agents can do what kinds of operations on which records in the Resource Space. The 
Operation Space stores Atomic Records. These records describe Operation objects that are 
identified by names and are called by the ASS when mobile agents interact with the ASS. Operation 
objects can be installed by an ASS system or by mobile agents for some special reflection operations 
to interaction events. 


Every Agent Server hosts an ASS object. When a mobile agent transports into a network node, 
the agent gets the reference to an ASS object from the Agent Server. An important feature of an ASS 
is that Operation Records, which are responding entities to interaction events, can be programmed, 
and can be installed dynamically by agents or an ASS itself. The whole procedure of an interaction 
event from the submission of an interaction requirement to the result return is graphically illustrated 
in Figure 4. 
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Figure 4: The reflection procedure of ASS object to an interaction requirement from a mobile agent 


Every interaction requirement can be represented by a Coupled Record: <AgentID, Operation, 
Syn, Rec>, which means: a mobile agent identified by AgentID hopes to execute an Operation on 
all the Records matching the template record Rec in a synchronous mode Syn. The agent provides 
the above requirement to an ASS object with the submit method of the ASS object to which the agent 
has a reference. The reflection procedure of an ASS to the interaction requirement goes through the 
following steps: 

1. The ASS object searches the whole Resource Space to identify records that match Rec, and uses 
these records to construct a vector VecRec. 
2. The ASS object searches the whole Operation Space to identify a record that matches 
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Operation; generating a thread for the matched operation object OpObj, and starting the 

execution of the thread. 

3. The Operation thread constructs a record SR=<AgentID, R> for each record R in vector VecRec 
and searches Security Space to identify the presence of a record that matches SR. 
4. If a matched record is found in Security Space, OpObj will execute the required Operation on 

R in the synchronous mode Syn. 

5. The results of the Operation can be returned to the agent or stored into Resource Space, Security 

Space or Operation Space. 

Some ideas for the ASS design come from Jada (Ciancarini and Rossi, 1997). In Jada’s space, 
objects are passive. ASS takes Jada’s concepts one step further by allowing components of ASS to 
actively respond to interaction events and by allowing components to be installed, updated and 
deleted dynamically by the ASS itself or mobile agents. 


4.2 Workflow template DTP for generating a continuous and autonomous workflow 

In this subsection, we introduce the Distributed Task Plan (DTP), which is a flexible 

implementation model of MTPM for generating continuous and autonomous workflows for mobile 

agents. MTPM simulates the state persistence of an agent’s thread by Serialisation and RMI without 
introducing any new spatial and time complexity. DTP is a workflow template that uses the 

programming paradigm of MTPM (Zhang and Li, 1999). 

In order to obtain desirable autonomies for mobile agents, we firstly provide enough 
programming components and autonomous primitives, which are used to construct a DTP and 
further to provide mobile agents with autonomies in the navigation and the computation. In MAT, 
four kinds of autonomous primitives are defined for DTP designing. 

1. Mobile primitives define the mobility of an agent. A mobile agent can merely transport itself to 
the next destination from the current network node by calling the simple migration primitive, or 
clone and transport each of its duplicates to different destinations by calling the multiple 
migration primitive. 

2. Computational primitives define invocations of computational resources. A computational 
primitive specifies where to find the current computational procedure, how to load it and how 
to execute it. By using the computational primitives, a mobile agent realises that (a) the current 
computational procedure is carried by the agent or is resident at the visiting node; (b) the 
procedure should be started in a different process or loaded into its own process; and (c) how to 
run the computational procedure, e.g. synchronously or asynchronously. 

3. Solution synthesis primitives define the combination of multiple solutions from different mobile 
agents. The solution synthesis is needed when a task is divided into several subtasks and 
executed by different mobile agents concurrently. It is highly efficient to divide a task into 
several subtasks and assign these subtasks to different mobile agents for execution when the task 
can be executed concurrently and multiple resources are available. 

4. Control primitives define the execution flow of mobile primitives, computational primitives and 
solution synthesis primitives. Enough control structures of control primitives are needed to 
efficiently coordinate the executions of all the above three primitives. 

Having defined primitives, we provide a reasonable model: DTP, which is used to depict 
distributed tasks for mobile agents by advantages of those pre-defined autonomous primitives. 
Normally, DTP is composed of all the four kinds of autonomous primitives. 

A DTP consists of primitives, which are arranged into two lists. A list, called a control queue 
(CQ), only contains control primitives, and the other list, called a reusable primitive list (RPL), 
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contains any primitives except control primitives. The architecture of DTP is graphically illustrated 
in Figure 5 (concrete meanings of primitives of Figure 5 are defined in (Li and Zhang, 1999) ). 
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Figure 5: The architecture of DTP 


When being created, a mobile agent plans its own DTP for the execution of a distributed task 
satisfying user’s requirements. The planning procedure includes Objective Matching, Primitive 
Selection and DTP Generation by using users’ requirements, network state information and task 
features. A mobile agent can also replan its DTP when the current DTP fails during execution. The 
planning of DTP is detailed in the next subsection. 

A mobile agent has a reference to its DTP, and a DTP has a reference to a CQ. All the objects, 
from the agent itself, DTP to autonomous primitives, have a run() method. Every objects’ run() 
method just calls the run() method of another object to which the former has a reference. The 
autonomous workflow of a mobile agent is generated when the mobile agent executes its DTP by 
running its run() method as shown in Figure 6. 
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Figure 6: The continuous and autonomous workflow of agent by DTP 


The order of primitives in a CQ is important. The control primitives in a CQ are executed 
sequentially. A control primitive in a CQ has one or more references to primitives in a RPL 
corresponding to which type the control primitive is. A reference to a control primitive in a CQ 
depicts a possible invocation to a primitive in a RPL. The order of primitives in a RPL is not 
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important because the invocations to them are determined only by references of control primitives 
in a CQ (see Figure 5). A RPL is just a repository of autonomous primitives that a mobile agent may 
need to execute when transporting in the underlying networks. So a mobile agent only executes 
control primitives in a CQ one by one, then further executes primitives in a RPL. An execution of 
a DTP is a continuous and autonomous workflow of a mobile agent. Constructing a complex 
workflow by using DTP provides a mobile agent with autonomy, flexibility and reusability in 
distributed applications. 


4.3 Planning and replaning of DTP 

According to a user’s objectives, a DTP object should be planned either by a mobile agent or a static 
agent. In order to make mobile agents lightweight in code load and to support complex planning 
strategies, we move the planning algorithm of DTP (Li and Zhang, 1999) from the mobile agent in 
a previous version of MAT into the static agent (called Planner) into our current version. When a 
mobile agent is created, it gets a task from its owner. The task is described by the following Atomic 
Record. 


OAR=<String TaskLable, String Objectives, Integer ApplicationID, Integer TaskID> 
With following codes, the agent puts its OAR into a current ASS, and later gets its DTP object 


from the current ASS. The received DTP object is matched to the user’s objectives described by 
OAR. Once a mobile agent gets its DTP object, the agent executes it. 


ASS.submit(new CoupledRecord(this.AgentID, ‘Put’, true, OAR)); 
AtomicRecord DTPID=new(*, OAR.Objectives, OAR.ApplicationID, OAR.TsakID); 
AtomicRecord DTPAR; 


while ((DTPAR= ASS.submit(new CoupledRecord(this.AgentID, ‘Get’, true, DTPID)))==null) 
{ Thread.sleep(time); }; 

DTP DTPObject=DTPAR.DTPObject; 

DTPObject.run(); 





When the agent fails to execute its current DTP object due to the dynamism of networks, it also 
uses the above codes to get a new DTP object, which is replanned by a Planner at its visiting 
network nodes, from the current ASS. The new DTP is an alternative workflow for the mobile agent 
to accomplish its owner's objectives in dynamic networks. 

There is one Planner running on every distributed ASS. A Planner continuously gets users' 
objectives (described by OAR) from the current ASS, generates DTP objects and also puts them into 
the ASS using the following codes. 


AtomicRecord OARTemplate=new('task', *, *, *); 
while (true) { 
AtomicRecord OARObject=ASS.submit(new CoupledRecord(this.AgentID, 'Get', true, OARTemplate)); 
if (OARObject!=null) { 


DTP DTPObject=planning(OARObject.Objectives); 
AtomicRecord DTPAR=new AtomicRecord(DTPObject,OAR.Objectives, OAR.ApplicationID, OAR.TaskID); 
ASS.submit(new CoupledRecord(this.AgentID, 'Put', true, DTPAR)); } 

else Thread.sleep(time); } 





222 Journal of Research and Practice in Information Technology, Vol. 33, No. 3, August 2001 











The procedure of DTP planning, running and replanning is graphically illustrated in Fig. 7. 





System / System 1+ 


Figure 7: Planning, running and replanning of DTP 


The Planner matches a user’s objectives with entries in Objective Matching Base (OMB) to 
decide which primitives are used. The selected primitives are arranged into a CQ or a RPL 
according to their types. The references of primitives in the CQ to primitives in the RPL are set up 
at the same time. The developers of MAT applications define the OMB. An OMB consists of 
reusable entries. The matching method uses the algorithm that we prefer to call Requirement 
Propagation to match a user’s objectives with pre-defined entries, which determine DTP planning. 
The Planner also retrieves a Network State Base (NSB) for information about current networks. A 
NSB is periodically updated by a MAT information server in order to reflect the dynamic networks. 
A concrete example of DTP planning is graphically illustrated in Figure 8. In this example a user 
wants to get a file ProgramTable from remote network node butterfly when the file is updated after 
February 22, 1999. 


Objective: Get file ProgramTable, Destination: butterfly and Condition: Updated after 2/22/99 
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Figure 8: DTP planing: objectives matching with reusable entries in OMB 
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Each entry of OMB has four components, an entry name with parameters, Process, PreProcess 
and PostProcess. Process is executed after the execution of PreProcess and before the execution of 
PostProcess. The executions of Process, PreProcess or PostProcess can possibly access other 
entries, so the objective matching is the Requirement Propagation by reusing entries in OMB. In the 
above example, the objective matching is in the order of 1-2-3-4-5-6 and the DTP planning is in the 
order of (a)-(b)-(c)-(d). The workflow generated by executing the planned DTP will be: 


Step I: Transporting to butterfly, 

Step 2: Monitoring file ProgramTable until it is updated after February 22, 1999, 
Step 3: Reading the file, and 

Step 4: Taking the file back the home machine. 


The replanning of DTP is the same as the planing of DTP but the Planner uses the OMB and the 
NSB in the current node rather than those in the home machine of a mobile agent. The replanning 
of DTP gives a mobile agent an alternative workflow to accomplish its owner’s objectives when 
current DTP fails. 


5 CONCLUSION 
Partially connected computers, such as laptops, personal digital assistants or modem-connected 
home computers, are involved with the development of the network, especially the wireless 
network. Applications launched from them are usually called mobile computation, which depends 
on the autonomies of mobile agents. MAT is a mobile agent system, which is embedded with high 
autonomies. As we know, many mobile agent systems, such as Telescript, Voyager and Aglets, do 
not or do not highly support autonomy of agents because their goal is not to support the novel Web 
application of mobile computation. Only the mobile agent system Agent Tcl pays some attentions 
to mobile computation. But none of them proposed the autonomy theory for the mobile agents. 
However, mobile computation will become a popular Web application with the development of 
wireless networks and more and more mobile devices, such as laptop computers and mobile phones. 
The feature, which distinguishes MAT from other agent systems, is that MAT concentrates on 
autonomies in the context of mobile agents. For supporting both mobility and autonomy, many 
concepts, paradigms and models, have been proposed for MAT implementation, which mainly 
include the autonomies of mobile agents, MTPM, DTP and ASS. All the proposed models are 
implemented currently. In MAT, a mobile agent is an autonomous entity that can plan its workflow 
and roam in networks for resource accessing or cooperation with other agents in asynchronous, 
anonymous and localized modes. By using MAT, a developer can use all kinds of templates, 
including DTP, Active Record and Object Matching Rule, to construct Web applications. A user of 
MAT applications only needs to provide his objectives and other necessary information such as 
features of tasks (concurrent or serial). Mobile agents can determine goal-directed work and 
cooperation modes according to users’ requirements, and try alternatives when an effort fails. 
MAT is a new prototype system implemented in Java that concentrates on providing agents with 
both mobility and autonomy. This paper summed up its current methodologies and implementation. 
A web application titled Intelligent Retrieval for the Information on the Web was constructed on 
MAT. This application supports the use of mobile devices. Another application titled Distributed 
Swarm is under construction. This application enables the Swarm (Swarm Design Group) into a 
distributed system. Our future work will focus on investigating the suitability and the effectiveness 
of MAT in Web applications, such as Internet information retrieval, electronic commerce and 
Computer Support Cooperative Work (CSCW). From feedback of the investigations, we may find 
problems in MAT, and make improvement to the methodology, model and their implementations. 
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This paper describes a new mechanism for meta-data update in file systems using 
flash memory. To fulfill flash memory as a stable storage system, one of its main 
draw-backs, meta-data update, which occupies the majority of write operations, 
must be reduced. That is, the file system should avoid weaknesses of flash memory, 
such as a limited number of program/erase cycles and its inability of in-place 
updates. We exploit two types of i-nodes which are allocated in separate flash areas 
and also introduce a delta logging method to efficiently utilise flash memory space. 
A trace-driven simulation shows the results that the performance of our system with 
two workloads is improved in terms of both time and space utilisation. 
Keywords: Operating Systems, Flash Memory, File System, Meta-data 


1 INTRODUCTION 

Traditional computer systems construct file systems using magnetic disks as secondary storage. 
However, magnetic disks are not recommended in the mobile computing environment because they 
consume lots of power and cannot guarantee reasonable resistance to shocks and vibration. Flash 
memory, a new type of solid state memory, has been replacing magnetic disks as a persistent storage 
system in the mobile environment since it provides useful characteristics, such as non-volatility, in- 
system programmability, low power consumption, reasonable tolerance, and high density (Douglis, 
Caceres, Kaashoek, Li and Marsh, 1994; Kawaguchi, Nishioka and Motoda, 1995; Wu and 
Zwaenepoel, 1994). 

On the other hand, flash memory also has drawbacks: erase-before-rewritten, write access times 
higher than read access times, and a limited number of program/erase cycles. 

Accordingly, there are several obstacles to overcome when implementing a file system on flash 
memory rather than magnetic disks. Conventional file systems based on magnetic disks have an 
inherently hot region for meta-data manipulation (e.g., a FAT in MS-DOS or super-blocks/i-nodes 
in UNIX file system). If they are implemented on flash memory, their own in-place update schemes 
do not make sense at all. The erase-before-rewritten constraint of flash memory makes it difficult 
to update new information in its place. Previous studies indicate that the majority of writes are for 
meta-data and emphasise that meta-data update is important to file system performance (Ganger 
and Patt, 1994; Ruemmler and Wilkes, 1993; Vahalia, Gray and Ting, 1995). 
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Meta-data updates, which frequently occur during file manipulation, impose overheads in spatial 
cost on the system since the overwritten region of the flash memory is forbidden. Furthermore, with 
UNIX-like file systems implemented on flash memory, the read operation of a file invalidates a 
whole i-node block for the file, and a new flash block for the i-node must be allocated because the 
file system updates the record of the last access time stamp for the file. Even bit-level meta-data 
updates require write operations on the entire data block. 

In this paper, we propose a mobile file system with flash memory (MFWEM) as a substitute for 
Hard Disk Drives (HDDs). The primary approach of MFWFM is to efficiently reduce the problems 
associated with meta-data updates while avoiding flash memory's drawbacks as mentioned. Our 
MFWFM exploits a new organisation of a file system, offering the concept of in-place updates on 
flash memory and devises a delta logging method to reduce the amount of flash memory spaces 
incurred by frequent occurrences of meta-data updates. What is more, MFWFM separates an 
existing i-node into two types of i-nodes in order to increase the throughput of write operations and 
to improve system performance, presenting parallelism of write operations to separated areas. 

The performance of proposed MFWFW is evaluated by a trace-driven simulation with realistic 
traces as well as synthetic traces. 

The remainder of this paper is organised as follows: Section 2 describes the MFWFM and 
various techniques to support it. Section 3 evaluates MFMWM using trace-driven simulations, 
presents the results, and analyses them. Section 4 reviews several issues related to our MFWFM. 
Section 5 summarises our results and presents our conclusion. 


2 MFWFM ARCHITECTURE 

2.1 Overview 

MFWFM architecture is logically composed of two layers: the upper layer and the lower layer. The 
upper layer, based on UNIX System V file system, supports system call interfaces and fetches 
information from lower layer facilities for file manipulations. The lower layer manages flash 
memory directly and is also divided into two sub-layers: DRAM layer and Flash layer. Figure 1 
shows the layout of the lower layer in MFWFM. 


segment pointer 


' sc_inode 





sc_inode 


Figure 1: Layout of MFWFM lower layer 
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For indicating segments of a file, the DRAM layer maintains three types of in-core segment 
pointer structures: fc_inode_segs—ptr, sc_inode—segs—ptr, and data—blk_segs_ptr. When a file is 
initially created, the fc_inode—segs_ptr represents all of its segments. 

After write operations for a file, modified segments are indicated by sc_inode_segs_ptr. The 
data_blk_segs_ptr stands for the list of data block segments. 

Bitmaps within each pointer represents the utilisation of flash memory for free space 
management: free_map and sane—_map. The former allows a fast searching of free blocks within the 
segment while the latter enables the segment manager to choose a victim segment in segment 
management processing. 

The reason for holding the bitmaps in DRAM is due to the drawbacks of the flash memory. As 
we have stated before, flash memory cannot be updated in-place, and bit-wise operation in flash 
memory is unavailable. 

Flash layer consists of three types of on-chip structures: first-chance-inode—segment, 
second—chance—inode-segment, and data_block-segment. All segments of a file can be arbitrarily 
mapped onto any of the three segments. Each segment contains its descriptor which represents 
information such as its segment type, number of erase cycles, and block usage. 


2.2 Separation of i-nodes 

MEWFM provides the concept of separation among i-nodes. Whenever data blocks of segments are 
modified or attributes of a file are changed, MFWFM logs the updates sequentially onto other 1- 
node data structures. Contrary to HDDs, flash memory does not support in-place update. Hence, a 
new i-node is devised for processing the updated information. 
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Once a file is created, a set of first-chance—inode, fc_inode, is defined. First-chance—node is 
identical to i-nodes in conventional UNIX, except it has a pointer to second—chance—inode, 
sc_inode. Second—chance—inode contains only the updated information, delta logging, which 
happens during file manipulations. Whenever applications deal with a file, MFWFM does not 
invalidate the entire block for the file, but only appends appropriate logs of the updated fields of 
fc_inode onto sc_inode. Data-block segment indicates data blocks related to each corresponding 
inodes. Figure 2 shows the relationships among first-chance—inode, second—chance_inode, and data 
blocks. , 

By means of separating the i-nodes, we can obtain parallelism of write operations. The write 
operations for both meta-data and data itself can be initiated sequentially but need not be processed 
synchronously. In other words, the work for data writes does not have to wait for the completion of 
their meta-data. 

The latter is guaranteed by flash memory characteristics irrespective of power failure. Even with 
a power loss, the last modification information remains consistent because both meta-data and pure 
data are logged. 

For the delta logging, each log is composed of log-type and log-info. Log-type is determined 
according to the updated attributes of a file, and the contents of log-info field contain the updated 
attributes. MFWFM provides various types of logs as follows: 

e FC_INODE-LOG-_TYPE : indicates the creation of a file. 


e F_-NAME-LOG_TYPE : indicates the changing name of a file. 
e F_SIZE LOG_TYPE : indicates the changing size of a file. 
¢ OWN_U-LOG_TYPE : indicates the changing user-ID of owner. 
¢ OWN-G-LOG_TYPE : indicates the changing group-ID of owner. 
e CRE-TIME-LOG_TYPE : stands for the i-node creation time. 
e LAC_TIME-LOG_TYPE : stands for the time stamp of last access time. 
¢ MDF_-TIME-LOG_TYPE : stands for the changing of i-node modified time. 
e DBLK—DIRECT-LOG_TYPE 
: indicates the changing of pointer for directly indexed data block. 


e DBLK-SINGLE-LOG_TYPE and INDX-SINGLE-LOG_TYPE 
: indicates the changing of pointer for single indirectly indexed data block. 


¢ DBLK-~DOUBLE-LOG_TYPE and INDX—-DOUBLE-LOG_TYPE 
: indicates the changing of pointer for double indirectly indexed data block. 


¢ DBLK—-TRIPLE-LOG_TYPE and INDX-TRIPLE-LOG_TYPE 
: indicates the changing of pointer for triple indirectly indexed data block. 


2.3 Segment manager 

Segment manager is invoked to reclaim the segments of flash memory when available segments for 
file operations are inadequate. For simplicity, MFWFM adopts a greedy policy, from which the 
segment manager selects the most invalidated segment for reclamation. Once the segment manager 
chooses a segment to be erased as a victim, all valid blocks of the segment are collected and copied 
onto a new segment. A command issued from MFWFM erases the victim segment and the erased 
segment is used for new data. Current flash hardware permits multiple segments to be erased 
simultaneously as long as each segment belongs to difierent banks. 
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2.4 Algorithms for File Operation 

File Open and Close 

The concept of open in MFWFM is to gather updated information from a set of logs in sc_inode 
and to construct complete i-node information. However, if the file being opened is a frequently 
accessed one, such as hot file, it may have too many logs to retrieve since a lot of sc_inodes may 
be attached to fc_inode for logging of 1-node modification. Therefore, MFWEM limits the number 
of loggings for each file so that it gains reasonable latency when opening a file. The total time to 
open a file in MFWFM is shown below: 


Topen = T fetch_fc—inode 25 T wpdate-inode 


T fetch-fc—inode +Niog*!log_type 


a 


Ninfo*"log—infot"fc*"fc_inode 


In the above equation, Tyerch_fc_inode 1S time to fetch fc_inode from flash memory and 
T update—inode 18 the time to update fc_inode along with logs from sc_inode, where tig¢_type 18 the time 
to read one log_type field from sc—inode, tjg-injo 18 the time to read one jg¢-info field from sc_inode, 
and t_ inode 18 the time to read one fc_inode field from sc_inode. Njgg is the number of all logs and 
is calculated by ninggtng , Where Nino is the number of log—info type logs and n 4, is the number of 
fc_inode type logs. 

Be that as it may, the total time to open a file in conventional file systems, such as FMFS, is 
diffierent from the equation shown above. Typdate_inode 18 not used for calculation. As a result, the 
main impact of open is dependent upon the cost of Typdate_inode. From segment manager's 
viewpoint, if the number of logs for each files exceeds the threshold that MFWFM offers, segment 
manager designates the particular segment as a victim for reclamation. While the segment manager 
cleans up the corresponding segment, all logs within the segment are collected and merged into a 
new fc—inode. Actually, even for files that contain a number of logs, the cost required for log 
gathering, Typdate—inode> 18 not critical because the read access time in flash memory is comparable 
to those of DRAM. 

Just as with UNIX, the last access time must be recorded when closing a file. However, time 
stamping in MFWFM has only to append a log which represents the change of the last access time. 
In contrast, conventional file systems invalidate the entire block for the i-node. The time to close a 
file is calculated as follows: 


Telose = T wpdate—incore-inode* 1 update—flash—inode 


T wpdate—incore-inode +l last_logt lappend-log 


where Typdate—incore-inode 18 time to update incore inode within DRAM and Tipydate_flash-inode 18 
time to copy updated information for i-node into flash memory. tigst_jog 1S time to search the last log 
by means of scanning sc—inodes and fgpnend_iog 1S time to write a log into sc_inode. 

In the above formula, tjgs_jog is a function of few nsec and tappend_tog 1S dependent upon the type 
of log. If there is no room for logging in preallocated sc_inode, T, jose requires additional cost which 
is needed for allocating a new sc—inode. Besides, if the number of logs exceeds the threshold, close 
procedure should invoke the segment manager instantly. 
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Update Policy in a Flash Memory File System 
File Create and Delete 

To create a file, MFWEM has to determine whether the corresponding fc_inode is being used or not. 
For initial creation, MFWFM has merely to set up appropriate fields on corresponding fc—inodes. 
Otherwise, MFWFM allocates a new sc—inode on the requested fc—inode, and establish fc_inode 
pointer to the sc_inode. When a given fc_inode is dirtied two or more times, MFWFM may 
probably contain one or more sc—inodes. At best, the overhead for file creation is identical to that 
of FMFS. However, MFWFM requires additional cost for allocating a new sc_inode and appending 
the log in the worst case. 

Delete algorithm in MFWFM is simple. First of all, MFWEM searches for the last fc_inode and 
marks its fc_validity field as INVALID. All indices for data blocks remain as they are when deleting 
the i-nodes but sane_map is marked as zero. Reclamation procedure of segment_manager uses its 
bit vector. 


File Read/Write and Other Operations 

Read algorithm in MFWFM is identical to read operations in traditional UNIX implementation. 
Since in-core i-node is already updated during open operation, read operations use complete 
information from in-core i-node. 

The algorithm for write is more complicated than the traditional UNIX implementation. We 
consider three types of write operations. In the case of modifying an allocated data block, MFWFM 
allocates a buffer cache in DRAM space for update in advance. MFWFM copies original data from 
the flash data block to buffer cache and modifies the corresponding data flock within the buffer 
cache. After the modification of data block, the contents of the buffer cache are written back to the 
flash memory area which is newly allocated. The index for the new data block must be recorded in 
sc_inode. In the case of appending new data to a file, write operations allocate new data blocks. 
MFWFM records their address into appropriate fields in fc_inode. The other case for write 
operation is to modify or truncate the size of a file. To truncate a file by a given size, MFWFM 
invalidates only the appropriate log instead of the entire i-node block and then appends such log to 
last log in sc_inode. 

To position a file pointer to a specified offset, MFWFM need only establish the current file 
pointer in in-core i-node. As stated before, in-core i-node acquires complete information during the 
open operation. As for seek operation, the current file pointer need not be recorded to actual i-node 
in flash memory. To change certain attributes of an i-node, MFWFM have to only append an 
appropriate log to a set of logs. 


Factor Specification 
Flash Memory Size 64 MBytes 
Number of Segment 16 
Segment Size 128 KBytes 
Segment Erase Time 1.2. sec 
Byte Program Time 6 usec 
Page Size 256 Bytes 
Page Program Time 1 msec 
Read Access Time 30 nsec 
Maximum Read Throughput 33.33 MBytes/s 
Maximum Write Throughput 244.1 KBytes/s 
Erase Throughput 106.7 KBytes/s 


Table 1: Flash Memory Component specification based on Intel 28F016XS 
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3 PERFORMANCE EVALUATION AND DISCUSSION 


For simulation, we assume a bus-based simulation architectural model that is composed of a CPU, 
a set of DRAMs, a set of flash memory, and two auxiliary I/O devices. 


3.1 Simulation Methodology 
To study the performance implications, we developed two trace-driven simulators which run on 
CSIM (Schwetman, 1992). 

One of the two implemented simulators is for simulating our designed MFWFM and the other 
is for simulating comparative file system model FMFS (Kawaguchi, et al, 1995). 

Table 1 represents the parameters of flash memory used in the simulation. Of all 64 MBytes of 
MFWFM space, the default portion for fc_inode type segment, sc_inode type segment, and 
data_block type segment is defined by 7%, 3% and 90% respectively. Unfortunately, since sufficient 
space, as with HDDs, cannot be reserved for file system spaces, our simulator cannot allocate actual 
file space for real traces. 

Consequently, we need new synthetic traces and also manage the file descriptor information to 
the full extent in order to evaluate the throughput of read/write operations and to measure the 
amount of invalidated flash memory during realistic file operations. 

The simulators are driven by two kinds of workloads: real traces and synthetic traces. The former 
is the Sprite file system traces which were collected from a real environment (Hartman, 1993). 

Using the Sprite file system traces, we can evaluate the performance of MFWEFM under a 
realistic file system environment. In the case of synthetic traces, file size information was generated 
based on the result of UNIX file size survey (Irlam, 1994). 

We assume the average interarrival time of each event in synthetic traces as 500 psec, while the 
average of inter-arrival time in the Sprite file system traces is about 17.5 msec. 
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Figure 3: Effect of read/write performance based in initial utilisation 
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3.2.1 Simulation Using Synthetic Traces 

The purpose of the first experiment is to test the sequential write performance to measure maximum 
throughput. In these traces, as for file systems filled with fixed initial data, a certain amount of data 
was repeatedly written. The comparison of sequential write performance in FMFS, synchronised 
MFWEFM and asynchronised MFWFM is shown in Figure 3(a). 

The curves show the result based on difierent initial data. Simulation results for FMFS show that | 
as the amount of initial data gradually increases, the throughput decreases. Despite this there is no | 
need to invalidate data blocks for writing, the time to find free data block will gradually increase 
according to the utilisation of the file system. The difierence between two types of MFWFM is to 
tell whether operation to write a data waits for the completion of write to an appropriate meta-data 
or not. The tiny improvement in the asynchronised-MFWFM is derived from reducing the time for 
recording a data block index into i-node. 

In MFWEM, read operations have only to fetch the address of data block with indices. Since log 
for indices is already updated during a file open, there is no need to find free block and to invalidate 
the flash memory. Unlike a HDD-based file system, since MFWFM is free from the overhead of 
seek latency and rotational delay, the results of read operations in traces have uniform values, 
regardless of the policy for block layout or file system utilisation. Figure 3(b) shows that the 
average read throughput is nearly 26 MBytes/s and provides the overall improvement of read 
performance against FMFS. 

This improvement is due to the reduction of remapping time. That is, in the case of MFWFM, 
the address of data block in i-node is the real address of flash data block. On the other hand, the 
block number in i-node should be remapped into references of the translation table in FMFS. 


3.2 Results and Analysis 
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Figure 4: File operation performance under real traces 
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3.2.2 Simulation using Realistic Workloads 

The purpose of simulation using real traces is to observe performance improvement in file 
Operations and to measure the effects of flash memory invalidation. Figure 4 summarises the mean 
service time for each file operation. From the simulation results, we can observe that the cost of 
open operation in MFWFM is four times greater than that of open in FMFS and MFWFM provides 
large variance for open operation. Open in FMFS should reference the translation table to remap the 
address of i-node block into flash memory and it actually requires a copy to read i-node block from 
flash memory. However, MFWFM requires additional cost to update i-node with logs for open 
operation. 

In contrast, the cost of close operation in FMFS is almost seven times higher than that of close 
operation in MFWFM, since close operation in FMFS requires write time for the entire i-node block 
even for modification of just one field, such as last access time. On the other hand, MFWFM need 
only write the appropriate log, LAC-TIME-LOG_TYPE. The costs for read/write operations are 
similar to the results of simulation using synthetic traces. In FMFS, the costs for mk—link and 
set_attr operations have identical values because they require full remapping of i-node block for the 
modification of each field. The cost of get_attr and seek operation represents the same value in all 
cases. 

Table 2 shows the comparison of throughput in read/write operations and the amount of 
invalidated flash memory. The result shows that the amount of invalidated flash memory in 
MFWFM is reduced by 14.75 % against FMFS. 


Model read throughput write throughput _ invalidated Bytes pure data ratio 
FMFS 28.245133 0.155778 998670336 0.939296 
MFWEM 28.924958 0.158302 870261296 0.972458 
MFWFM a) 28.924958 0.158668 870261296 0.972458 


Table 2: Comparison of throughput in read/write operations and amount of invalidated flash memory under real 
traces : unit in throughput is average MBytes/s 


4 Related Work 
There has been numerous work done inanumber of fields related to our interests. Meta-data log- 
ging (Vahalia, et al, 1995) for NFS server and meta-data update (Ganger and Patt, 1994) supporting 
ordered writes aimed to handle crash recovery. They tried to deal with meta-data operation 
efficiently, maintaining the integrity of the file system on-disk structures. Furthermore, IBM's 
System R (Astrahan et al, 1976, Gray et al, 1981) used the transaction log and the shadow-file 
mechanism to facilitate application specific recovery for medium failure. Srinivasan et al (1985) 
designed the shadow-file concept coupled with the check point for recovery in microprocessor 
systems. 

Douglis et al (1994) examined three storage alternatives for mobile computing. They found that 
at a 90 % utilisation or above, of large erasure units might result in performance degradation. 

Caceres et al (1993) emphasised the operating system's role to increase read throughput as well 
as to reduce write performance with battery-backed DRAM. 

eNVy (Wu and Zwaenepoel,1994) presented its storage space as a linear and memory mapped 
array and found it was actually a large non-volatile main memory storage system rather than a file 
system. 
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To overcome flash memory's inability of in-place update and high write cost, segment clean or 
data reclamation was provided in log-structured file systems (LFS) (Robinson, 1996; Rosenblum 
and Ousterhout, 1992; Seltzer, Bostic, Mckusick and Staelin, 1993). 

Kim et al (1999) proposed a new flash memory management based on a log structured file 
system, focusing on lowering the cleaning cost and evenly utilising flash memory cells. With the 
collection operation of cold and non-cold data, they expected to reduce the total cleaning cost 
compared to greedy policy. 

MS-Flash (Intel Corporation, 1995) was provided as a MS-DOS compatible file system using 
flash memory cards. For flash areas that were overwritten frequently, MS-Flash handled them by a 
linked-list data structure. However, even for byte-level modifications, since the whole space for the 
file should be invalidated, the overhead of meta-data update has significantly exerted a negative 
influence on the file system. 

Kawaguchi et al (1995) proposed a flash memory based file system (FMFS) that supports UNIX 
file system transparency and segment cleaning. It was executed by way of LFS to minimise flash- 
memory's write cost. Whenever data blocks were overwritten through write operations, FMFS 
invalidated all of them and issued several commands to reclaim the invalidated data blocks. 
Therefore, whether write operations were originated from user data blocks or meta-data, FMFS 
intrinsically imposed overheads in spatial cost as utilisation of flash memory increased. 

From a hardware point of view, a lot of studies have been done to make the best of flash memory 
in applications such as solid state disks and mobile computing. Khouri et al (2000) proposed a fast 
read word-line and program bit-line voltage regulator for multilevel flash memories. Low power 
structure added to the basic circuit was devised for high recovery speed. 

Mohammad et al (2000) provided algorithms for handling faults specific to flash memories. 
They exploited test algorithms to detect three types of disturbances in flash memories. 


5 Conclusion 

Flash memory is becoming an increasingly important medium of persistent storage in the mobile 
environment. Characteristics, such as high density, non-volatility, and low power consumption, 
flash memory makes it the ideal storage system, replacing the conventional HDD's place in mobile 
computing. However, flash memory should overcome its inability of in-place updates in order to be 
used as a stable storage system. 

In this paper, we proposed a new file system, MFWFM, that provides traditional UNIX system 
call transparency based on flash memory. To minimise the amount of meta-data update as much as 
possible, MFWFM separated conventional i-node into two types of i-nodes: fc_inode and sc—inode. 
Each time when an application updates contents or fields of a file, MFWFM does not invalidate the 
entire block but appends only appropriate logs onto sc_inode. To maintain a sufficient number of 
free blocks, segment manager collects insane segments. Even while update of meta-data is 
progressing, MFWFM allows other applications to update their own data. Our simulation results 
provide evidence that MFWFM outperforms conventional FMFS. 

However, a number of interesting works still remain, such asthe modelling of the likely uses of 
mobile computing environment and the benchmarking for various realistic sized flash memory. 
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With the proliferation of relational database programs for PCs and other platforms, 
many business end-users are creating, maintaining, and querying their own 
databases. More importantly, business end-users use the output of these queries as 
the basis for operational, tactical, and strategic decisions. Inaccurate data reduce 
the expected quality of these decisions. Implementing various input validation 
controls, including higher levels of normalisation, can reduce the number of data 
anomalies entering the databases. Even in well-maintained databases, however, data 
anomalies will still accumulate. To improve the quality of data, databases can be 
queried periodically to locate and correct anomalies. This paper reports the results 
of two experiments that investigated the effects of different data structures on 
business end-users’ abilities to detect data anomalies in a relational database. The 
results demonstrate that both unnormalised and higher levels of normalisation lower 
the effectiveness and efficiency of queries relative to the first normal form. First 
normal form databases appear to provide the most effective and efficient data 
structure for business end-users formulating queries to detect data anomalies. 
(Keywords: End-user queries, Data quality, Normalisation) 


1. INTRODUCTION 

Since the late 1980s, the availability and usability of relational database programs has significantly 
increased, especially for mini- and personal computers. This increase has led to a proliferation of 
databases being created and maintained by business end-users. These databases are often queried to 
provide information for important business decisions. The quality of these decisions depends on the 
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quality of the data, i.e., the value and usefulness of information systems depend on the accuracy of 
their data (Janson, 1988; Redman, 1995; Kaomea and Page, 1997). Empirical studies of corporate 
databases have shown that many organisations’ databases contain anomalies (Morey, 1982; Laudon, 
1986; Icerman and Hillison, 1991; Medawar, 1995). Information technology (IT) professionals 
designed these corporate databases and their controls. Databases designed by business end-users, 
who have limited or no training in relational databases, are likely to contain an even larger 
percentage of anomalies. 

Relational database theory has demonstrated that higher levels of normalisation reduce data 
redundancy (Codd, 1970; Brookshire, 1993; Elmasri and Navathe, 2001; Date, 2000; Date, 1998). 
This reduction in redundancy minimises data anomalies introduced while inserting, updating, or 
deleting data elements or records. Normalisation eliminates repeating groups and resolves lack of 
atomicity. Atomic fields contain only one piece of data, e.g., a single discount field that contains 
both the discount percent and the number of days to which the discount applies violates atomicity. 
The presence of repeating groups and non-atomic data attributes makes queries more complex. 
Unfortunately, decreased data redundancy resulting from higher levels of normalisation leads to an 
increase in data fragmentation, i.e., further normalisation generates larger numbers of tables with a 
smaller number of attributes in each table. This increased fragmentation leads to increases in query 
complexity as more joins and associated subqueries are required. Thus, increasing the level of 
normalisation may adversely affect business end-users’ ability to formulate queries. 

Because of data quality concerns, business end-users may need to query databases to detect 
anomalies. Little empirical research has examined the effects of various levels of normalisation on 
business end-users’ ability to develop such queries. Through two experiments, this study examines 
the ability of business end-users to query a typical business database to locate data anomalies. The 
first experiment examines business end-user queries for three sets of data structures: not normalised 
(=NPF), first normal form (1NF), and third normal form (3NF). These three data structures match the 
two structures often observed in professionally developed databases (3NF), business end-user 
developed databases (~NF), and a viable alternative (INF). The second experiment focuses on 
differences in queries with 1NF and 3NF data structures. 

Typically, business end-users design and create databases by focusing on access to data without 
regard to relational database theory. The resulting data structure may contain multi-valued data 
fields and redundant data, i.e., >NF. Business end-users often do not understand how multi-valued 
data fields and redundant data lead to insert, update, and delete anomalies, i.e., how their database 
designs lead to data quality problems. If they experience difficulty designing a database, business 
end-users can seek the help of IT professionals, but these professionals are likely to design 
databases to minimise data redundancy, e.g., by normalising to 3NF, but at the expense of ease of 
querying. Thus, unaided business end-users produce unnormalised data structures while database 
professionals produce third normal form or higher data structures. Intermediate data structures such 
as INF may, however, be more appropriate for business end-user queries. 


2. EFFECTS OF NORMALISATION ON LOCATING DATA ANOMALIES 

2.1 Data Accuracy and Data Quality 

Ideally, organisations would maintain data quality by preventing anomalies from entering databases 
(Huh et al., 1990). Organisations can reduce anomalies by creating well-designed databases and 
implementing extensive input validation controls. Normalisation of data structures, typically to 
third normal form or beyond, helps minimise the occurrence of insert, update, and delete anomalies. 
Good input validation controls also help minimise the possibility of data inaccuracies occurring 
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within a database. Unfortunately, even well-designed databases with extensive input validation 
controls do not result in completely accurate data. To avoid making decisions based on erroneous 
data, anomalies in databases need to be detected and corrected. One way to locate and correct data 
errors in a database is to query it periodically. As we will show, the underlying structure of the data, 
i.e., the normalisation level, affects the form, content, and complexity of these queries. 


2.2 Normalisation 

Normalisation is an elimination procedure (Codd, 1970) that gives successively more desirable 
levels of structural characteristics (Date, 2000). For a relation to be in first normal form (1NF), no 
attribute in the relation can contain more than one data value (Date, 2000). When a data structure 
moves from unnormalised (=NF) to first normal form (1NF), all repeating fields are eliminated and 
non-atomic data items are decomposed into multiple atomic elements, i.e., all multi-valued 
attributes are resolved. When the level of normalisation increases from 1NF to 3NF, all partial and 
transitive dependencies are eliminated. The removal of repeating fields, non-atomic data items, and 
partial and transitive dependencies increases the fragmentation of the database, i.e., increases the 
number of tables. Other normal forms include Boyce-Codd normal form, fourth normal form, and 
fifth normal form. Because the experiments used only —=NF, 1 NF, and 3NF structures, Boyce-Codd 
(BCNBF), fourth (4NF), and fifth (SNF) normal forms are not discussed here. 


2.3 Relationship Between Normalisation and Query Complexity 
Repeating fields and non-atomic data make queries of unnormalised data structures more complex. 
To retrieve data from data structures containing repeating fields requires numerous AND and OR 
operators in the condition clause of the query, i.e., in the WHERE clause of an SQL query. Non- 
atomic attributes increase query complexity because of the additional string manipulation functions 
required to extract specific data items. Data structures normalised to 1NF reduce the complexity of 
queries by eliminating repeating fields and non-atomic attributes. Although a slight increase in 
fragmentation occurs, compared with a=NF, queries of INF data structures contain fewer terms and 
operators. The following information request with queries for ~NF and INF data structures 
illustrates how multi-valued attributes make queries more complex. 

List the items and corresponding receipt numbers where the payment date is earlier than the 
receipt date plus half the number of term days. 


SELECT itemno, recrepnol, recrepno2 SELECT ITEMB.itemno, recrepno 
FROM ITEMA FROM ITEMB, RECEIPTSB 
WHERE paydatel < WHERE ITEMB.itemno=RECEIPTSB.itemno 


(recdate1+(0.5*(TO_NUMBER AND paydate < (recdate+(0.5*termdays)); 
(SUBSTR(terms,3 INSTR(terms,’/’)+2))))) 

OR paydate2 < (recdate2+(0.5*(TO_NUMBER 
(SUBSTR(terms,3 INSTR(terms,’/’)+2))))); 





Databases normalised to higher normal forms, e.g., 3NF, often contain substantially more tables 
than lower normal forms, e.g., 1 NF. To perform useful queries on these data structures, the data have 
to be accessed through join operations (Beynon-Davies, 1992). Therefore, although they eliminate 
update anomalies, higher normal forms often require additional joins to retrieve the desired data 
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(Date, 1986). The following information request with queries for INF and 3NF data structures 


illustrates this increase in complexity. 
List the items and corresponding receipt numbers where the payment date is earlier than the 


receipt date plus half the number of term days. 


SELECT ITEMB.itemno, recrepno SELECT ITEMC.itemno, RECEIPTSC.recrepno 
FROM ITEMB, RECEIPTSB FROM ITEMC, VENDORC, RECEIPTSC, PAYMENTSC 
WHERE ITEMB.itemno=RECEIPTSB.itemno WHERE ITEMC.vno=VENDORC.vno AND 


AND paydate < (recdate+(0.5*termdays)); RECEIPTSC.itemno=ITEMC.itemno AND 
PAY MENTSC.recrepno=RECEIPTSC.recrepno 
AND paydate < (recdate+(0.5*termdays)); 





As the complexity of the task increases, i.e., formulating queries in 7NF or 3NF rather than INF 
data structures, the resulting cognitive demands on individuals increase (Campbell, 1988). 


3. HYPOTHESES 

3.1 Effectiveness 

Databases in =NF contain non-atomic data items, repeating fields, or both. Data structures in 1NF 
are free of repeating fields and non-atomic attributes. Relative to 1NF, the structural problems 
associated with =NF data structures increase query complexity. Higher query complexity decreases 
business end-users’ ability to formulate effective queries, e.g., users of =NF data structures are less 
able to formulate effective queries to locate data anomalies than users of INF data structures. This 
leads to the following hypothesis: 


Hypothesis la: INF users will find more data anomalies than +NF users. 


Relative to 1NF data structures, data in 3NF are fragmented into more tables and require users 
to join more tables to retrieve the desired information. Hence, relative to INF, queries in 3NF will 
be more complex. As before, higher complexity decreases a business end-user’s ability to formulate 
effective queries. This leads to the following hypothesis: 


Hypothesis 1b: INF users will find more data anomalies than 3NF users. 


Compared to 3NF, queries of =NF databases require fewer joins as all data items are located in 
fewer tables, perhaps only one. -NF users, however, must contend with repeating fields, non-atomic 
data, and redundant data. These structural problems with -NF databases increase query complexity 
more than the fragmentation associated with 3NF databases. The following hypothesis represents 
business end-users’ decreasing ability to formulate effective queries as query complexity increases: 


Hypothesis Ic: 3NF users will find more data anomalies than NF users. 


3.2 Efficiency 

A common approach to coping with complexity is to use a divide-and-conquer strategy. Users may 
attempt to circumvent the difficulties associated with ~NF by developing multiple queries rather 
than a single query to retrieve the desired data. This strategy will generate more queries to detect 
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the same number of data anomalies. Furthermore, due to query anomalies associated with higher 
complexity, a higher percentage of =NF queries will fail to produce the desired results. Thus, more 
NF queries will be required to obtain the same results as INF queries. This leads to the following 
hypothesis: 


Hypothesis 2a: INF users will find more anomalies per query than =NF users. 


Similarly, relative to INF, 3NF users may circumvent the additional complexity associated with 
data fragmentation by using multiple queries. This strategy will generate more queries to obtain the 
desired information. Alternatively, due to a greater probability of making query mistakes because 
of the higher complexity, a higher percentage of 3NF queries will fail to produce the desired results. 
This failure will require users to make more attempts to construct a query that produces the desired 
results. Thus, more 3NF queries will be required to obtain the same results as 1NF queries. This 
leads to the following hypothesis: 


Hypothesis 2b: INF users will find more anomalies per query than 3NF users. 


The non-atomic data items and repeating fields associated with ~NF increase complexity more 
than the data fragmentation associated with 3NF. Hence, =NF users will experience more difficulty 
formulating queries than 3NF users. Thus, more =NF queries will be required to obtain the same 
data than 3NF queries. This leads to the following hypothesis: 


Hypothesis 2c: 3NF users will find more anomalies per query than NF users. 


3.3 Questions Attempted 

More complex queries require more time to formulate, i.e., users will need to spend more time 
retrieving data to satisfy individual information requirements. Because querying a “NF data 
structure is more complex than querying a 1NF data structure, users of ~NF data structures will 
need to spend more time retrieving data to satisfy the individual information requirements. Thus, 
overall, users of =NF data structures are likely to investigate fewer questions than users of INF data 
structures. This leads to the following hypothesis: 


Hypothesis 3a: INF users will investigate more questions than =NF users. 


Because querying a 3NF data structure is more complex than querying a 1NF data structure, 
users of 3NF data structures will need to spend more time retrieving data to satisfy the individual 
information requirements. Thus, overall, users of 3NF data structures are likely to investigate fewer 
questions than users of 1NF data structures. This leads to the following hypothesis: 


Hypothesis 3b: INF users will investigate more questions than 3NF users. 


Because querying a —NF data structure is more complex than querying a 3NF data structure, 
users of ~NF data structures will need to spend more time retrieving data to satisfy the individual 
information requirements. Thus, overall, the users of ~NF data structures are likely to investigate 
fewer questions than users of 3NF data structures. This leads to the following hypothesis: 


Hypothesis 3c: 3NF users will investigate more questions than “NF users. 
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sation on the Ability of Business End-Use 
4. RESEARCH METHOD: EXPERIMENT 1 

4.1 Research design, participants, and data collection 

This study used a laboratory experiment to test the hypotheses. Participants composed and executed 
queries in SQL for an Oracle database in one of three normalisation levels shown in Appendix A. 

With few exceptions, e.g., Chan et al., (1999), traditional approaches to testing query perfor- 
mance have used pencil and paper or simulated systems that do not provide the interactive feedback 
of typical organisational databases (e.g., Davis, 1990; Suh and Jenkins, 1992; Chan et al., 1993). 
The user interface for this experiment was more realistic because participants received interactive 
feedback, both syntactic error messages and syntactically correct query results, from the system. 
Furthermore, they could correct and re-enter their queries as many times as they wished. 

Eighty advanced undergraduate and master’s level commerce and information technology 
students participated in the experiment. All participants were familiar with general computing 
concepts and activities and, prior to the experiment, had received training in developing SQL 
queries and data integrity controls. The experiment required participants to locate data 
inconsistencies or potential data inconsistencies in a just-in-time (JIT) II inventory system 
(Borthick et al., 1998). Each participant examined one of three data structures: not normalised 
(=NF), first normal form (1NF), or third normal form (3NF). All participants received a set of 
instructions containing the scenario and the details of tasks to be performed (Appendix B). 
Participants had two hours to search for inconsistencies in their assigned database by constructing 
appropriate queries for as many of the 12 questions as they could. Queries for these questions 
appear in Appendix C. Participants created script and log files to record their sessions, which were 
the source for the data for testing the hypotheses. 

To eliminate any experience effect, participants were assigned randomly to one of the three 
treatment groups according to their information systems background. Based on the information 
provided by each participant, a knowledgeable IS person ranked each participant in ascending order, 
i.e., the person with the most extensive IS background was ranked 80 and the next ranked 79, etc. 
The order of the three treatment groups was then randomised, with the order being: =NF, 3NF, and 
INF. Participants were assigned according to their rank, i.e., 80 to NF, 79 to 3NF, 78 to INF, 77 
to INF, 76 to 3NF, 75 to ~NF, etc. This method of randomisation was intended to make the IS 
experience and training of the groups as equivalent as possible. 


4.2 Measures 

The independent variable for all hypotheses was the level of normalisation: ~NF, 1NF, or 3NF. The 
three dependent variables were the number of anomalies found, the number of anomalies found per 
query, and the number of questions attempted. To determine the number of anomalies found, the 
anomalies found by participants were matched with the list of anomalies seeded into the databases. 
The number of anomalies found per query was calculated by dividing the total number of anomalies 
found by a participant by the total number of queries executed by that participant. The number of 
questions attempted was determined by examining the log files. 


5. RESULTS: EXPERIMENT 1 

5.1 Summary Results 

Table 1 shows a summary of the results from experiment 1. For each of the measures, the 1 NF group 
performed best followed by the 3NF and -NF groups, respectively. 
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Model Mean Std Dev Count (N) 


Number of Anomalies Found 


aNF 5.3462 2.9113 26 

INF 8.2917 D.Do22 24 

3NF 7.2308 3.2779 26 
Number of Anomalies Found/Query 

aNF 0.4838 0.2595 24 

INF 1.0056 0.3210 21 

3NF 0.8594 0.4160 23 
Percent of Actual Anomalies Found 

aNF 0.4508 0.2497 24 

INF 0.5360 0.2240 2] 

3NF 0.4804 0.1925 23 
Number of Questions Attempted 

aNF 4.7083 1.5737 24 

INF 6.0476 1.6127 21 

3NF 3.0922 1.2288 23 


Table 1: Summary of results 


5.2 Effects of Normalisation on Query Effectiveness 

Hypothesis | predicted a direct relationship between the level of normalisation and the number of 
data anomalies found. The results of an analysis of variance (ANOVA) provided significant support 
for this hypothesis (R* = 0.1274, F, 73 = 5.33, p = 0.0069, two-tailed test). The least-squares means 
results (Table 2) show the presence, magnitude, and direction of differences between the three 
normal forms. The least-squares means results support the relationship predicted by Hypothesis 1a, 
i.e., INF users found significantly more anomalies than —=NF users. The results also support 
Hypothesis Ic, i.e., 3NF users found significantly more anomalies than “NF users. Statistically 
significant support was not found for Hypothesis 1b, i.e., that INF users would find more anomalies 
than 3NF users. The results, however, were in the predicted direction. 





Normalisation Level Least-Squares Means Significance of Significance of 
Pairwise Comparison Pairwise Comparison 
with =NF with INF 
—aNF 5.3462 
INF 8.2917 0.0020 
3NF 7.2308 0.0396 0.2515 





Table 2: Least squares means analysis for the number of anomalies found 


5.3 Effects of Normalisation on the Efficiency of Error Finding 

Hypothesis 2 predicted a direct relationship between the level of normalisation and the number of 
data anomalies found per query. The results of an ANOVA provided significant support for this 
hypothesis (R? = 0.3089, F, 65 = 14.53, p = 0.0001, two-tailed test). The least-squares means results 
(Table 3) confirm the relationship predicted by Hypothesis 2a, i.e., INF users found significantly 
more anomalies per query than NF users. The results also support Hypothesis 2c, i.e., 3NF users 
found significantly more anomalies per query than =NF users. Statistically significant support was 
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not found for Hypothesis 2b, i.e., that INF users would find more anomalies per query than 3NF 
users. The results, however, were in the predicted direction. 








Normalisation Level Least-Squares Means Significance of Significance of 
Pairwise Comparison Pairwise Comparison 
with ~NF with INF 
aNF 0.4838 
INF 1.0056 0.0001 
3NF 0.8594 0.0003 0.1561 


2 aca cacaacaccccaaacccaccacaaacaaaaaaaa 


Table 3: Least squares means analysis for the number of anomalies found per query 


5.4 Effects of Normalisation on the Number of Questions Attempted 

Hypothesis 3 predicted a direct relationship between the level of normalisation and the number of 
questions attempted. The results of an ANOVA provided significant support for this hypothesis ° 
(R2 = 0.1313, Fy 65 = 4.91, p = 0.0103, two-tailed test). The least-squares means results (Table 4) 
support the relationship predicted by Hypothesis 3a, i.e., INF users attempted significantly more 
questions than =NF users. The results also support Hypothesis 3c, i.e., 3NF users attempted 
significantly more questions than -NF users. Statistically significant support was not found for 
Hypothesis 3b, i.e., that INF users would attempt more questions than 3NF users. The results, 
however, were in the predicted direction. 








Normalisation Level Least-Squares Means Significance of Significance of 
Pairwise Comparison Pairwise Comparison 
with ~NF with INF 
aNF 4.7083 
INF 6.0476 0.0036 
3NF 5.0522 0.0329 0.3805 





Table 4: Least squares means analysis for the number of questions attempted 


5.5 Summary of Experiment 1 

The results from experiment 1 demonstrated that the normalisation level affected users’ ability to 
retrieve data and thus to find data anomalies. The results showed that INF and 3NF data structures 
were more desirable for retrieving data than —NF data structures, i.e., INF and 3NF users found 
more anomalies, found more anomalies per query, and attempted more questions that =NF users. 
Although not statistically significant, results were consistent with hypotheses that INF data 
structures were more desirable for queries than 3NF data structures, 1.e., INF users found more 
anomalies, found more anomalies per query, and attempted more questions than 3NF users. The 
non-significance of the results could, however, be a function of the experiment. Because the time 
available for the participants to perform the tasks in the experiment was limited, most participants 
only attempted to answer a small proportion of the questions. This resulted in participants 
answering few of the questions designed to test the differences between INF and 3NF data 
structures. These questions were further down the list and were only attempted by a few 
participants. To overcome this limitation and enable a better comparison between user performance 
with INF and 3NF data structures, a second experiment was conducted. 
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6. RESEARCH METHOD: EXPERIMENT 2 


6.1 Research design, participants, and data collection 

The purpose of experiment 2 was to test whether a 1NF data structure is more effective and efficient 
than a 3NF data structure for business end-users querying a database. This second experiment was 
similar to experiment 1. 

The data structure (Appendix A), the scenario (Appendix B), the pretraining of participants, and 
the method of assignment to groups were identical to those experiment 1. There were three differences 
between the experiments. First, three levels of normalisation (NF, 1NF, and 3NF) were used in 
experiment | but only two levels of normalisation (1NF and 3NF) were used in experiment 2. Second, 
participants composed and executed SQL queries for 12 questions in experiment 1 whereas 
participants composed and executed SQL queries for 14 questions in experiment 2. These 14 questions 
resulted in queries (Appendix D) that differed substantially between the two levels of normalisation 
being tested. Third, there were 93 participants in experiment 2. These additional participants and the 
removal of a level of normalisation increased the number of participants in each group. While there 
were approximately 23 participants at each of the three normalisation levels in experiment 1, there 
were approximately 46 participants at each of the two normalisation levels in experiment 2. 


6.2 Measures 

The independent variable for all hypotheses was the level of normalisation (NL): 1NF or 3NF. The 
dependent variables were the number of questions attempted and the percent of correct anomalies 
found per query. Percent of anomalies found per query (as opposed to the number of anomalies found 
per query) was considered a more appropriate measure for this experiment because the number of 
anomalies per query ranged between | and 64. To determine the percent of correct anomalies found 
for each question, the anomalies found by the participants were divided by the number of anomalies 
seeded into the databases. The number of questions attempted was determined by examining the log 
files. Grade point average (GPA) was used as the covariate in each of the ANCOVAs. 


7. RESULTS: EXPERIMENT 2 
7.1 Summary Results 
Table 5 summarises participant characteristics and performance by normalisation level. 





Normalisation Level 





INF 3NF 
Grade Point Average (7-point scale, 7 highest) 
Mean 4.7111 4.9821 
Standard deviation 0.8468 0.7648 
Gender 
Number of males 28 29 
Number of females 18 18 
Total number of participants 46 47 
Number of Requests Attempted 
Mean 8.5652 7.4468 
Standard deviation 2.0939 2.1245 
Percent of correct anomalies found/query 
Mean 77.5449 76.6953 
Standard deviation 38.9135 40.2112 
Table 5: Participant Characteristics and Summary Performance 
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7.2 Effects of Normalisation on Query Effectiveness 

Hypothesis 1b and 2b predicted a direct relationship between the levels of normalisation and the 
data anomalies found. Comparing 1NF with 3NF, nested ANCOVA (GPA as covariate) results 
indicate that the percent of correct anomalies found was significantly associated with the 
normalisation level (F;; 725 = 2.00, p = 0.0262, two-tailed test) (Table 6, Panel A). The means results 
(Table 5) confirm the relationship is in the direction predicted by hypothesis 1a, i.e., INF users 
identified a higher percentage of correct anomalies than 3NF users. 


7.3 Effects of Normalisation on the Number of Questions Attempted 

Hypothesis 3b predicted a direct relationship between the level of normalisation and the number of 
questions attempted. The results of a t-test (t = 2.5564, p = 0.0122, two-tailed test) (Table 6, Panel 
B) confirm the relationship is in the direction predicted by hypothesis 3b, i.e., INF users attempted 
significantly more questions than 3NF users. 





Panel A: Effect on Percent of Correct Anomalies Detected 








Source R2 df Mean Square F Value Pr>F 
Model 0.2106 25 24.66 7.74 0.0001 
Error WZ) 92.43 

Query 13 155 12.15 0.0001 
Normalisation Level (Query) 11 0.25 2.00 0.0262 
GPA 1 3.05 23.94 0.0001 





Panel B: Results of t-test with Number of Questions Completed 








Normalisation Level Mean Std Dev t value p value 
INF 8.5652 2.0939 2.5564 0.0122 
3NF 7.4468 2.1245 


Table 6: Main Model and Efficiency Effects as a Function of Normalisation Level (INF, 3NF) 


7.4 Post Hoc Analysis: Effects of Normalisation on Number and Type of Errors 
Table 7 shows the total number of semantic errors participants made in retrieving the data 
anomalies. It also shows whether the errors were row, column, or aggregation errors. 

Results of a nested ANCOVA (GPA as covariate) indicate that the number of errors was 
significantly associated with the normalisation level (F,; 7.5 = 16.43, p = 0.0001, two-tail test) 
(Table 8, Panel A). The means results (Table 7) confirm that INF users made fewer errors 
composing queries to extract data anomalies than 3NF users. 

Results of a nested ANCOVA (GPA as covariate) indicated that 3NF users on average made 
more column errors when composing their queries than INF users (F,; 725 = 7.47, p = 0.0001, two- 
tail test) (Table 8, Panel B). The means results (Table 7) confirm the direction of the relationship. 
The results for row errors were also significant with 3NF users making more row errors when 
composing queries than 1 NF users (Fj; 775 = 27.19, p = 0.0001, two-tail test) (Table 8, Panel C). The 
means results (Table 7) confirm the direction of the relationship. Although 3NF users made more 
errors than INF users, as expected, the result for aggregation errors was not statistically significant 
(Table 8, Panel D). 
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Normalisation Level 




















INF 3NF 

Number of Errors 

Mean per attempted query 0.8655 1.5462 

Standard deviation 1.4123 2.4918 
Number of Column Errors 

Mean per attempted query 0.3020 0.4454 

Standard deviation 0.7701 1.1094 
Number of Row Errors 

Mean per attempted query 0.7232 0.9468 

Standard deviation 2.2713 1.5379 
Number of Aggregation Errors 

Mean per attempted query 0.1497 0.1541 

Standard deviation 0.4285 0.4257 
Number of Extraneous Data Anomalies Found 

Mean per attempted query 8.7056 12.8711 

Standard deviation 24.5896 31.3460 
Table 7: Effect of Normalisation Level on Performance 

Panel A: Effect on Total Number of Errors 
Source R2 df Mean Square F Value Pr>F 
Model 0.4442 2) 54.75 23.18 0.0001 
Error Is 2.36 
Query 13 41523 30.16 0.0001 
Normalisation level (Query) 11 38.81 16.43 0.0001 
GPA 1 37.35 28.52 0.0001 
Panel B: Effect on Column Errors 
einen en tie 
Source R2 df Mean Square F Value Pr>F 
Fagg ems tse ete gee pte ela 
Model 0.3405 25 9.19 14.97 0.0001 
Error 725 0.61 
Query 13 13.50 21.99 0.0001 
Normalisation level (Query) 11 4.59 7.47 0.0001 
GPA 1 10.76 17.53 0.0001 
cer apne i csi eggplant 
Panel C: Effect on Row Errors 
cn cc a ep tar 
Source R2 df Mean Square F Value Pr>F 
epi iin tpn 
Model 0.4785 phe) 21,07 26.61 0.0001 
Error 725 0.79 
Query 13 22.19 28.03 0.0001 
Normalisation level (Query) 1] Zio 27.19 0.0001 
GPA 1 15.64 19.75 0.0001 
a nn eC ee 
(Continued next page) 
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Panel D: Effect on Aggregation Errors 








Source R2 df Mean Square F Value Pr>F 
Model 0.1976 25 1.08 7.14 0.0001 
Error 725 0.15 

Query 13 1.84 12.19 0.0001 
Normalisation level (Query) 11 0.22 1.45 0.1461 


GPA 1 0.94 6.24 0.0127 
Table 8: Effect of Normalisation at INF and 3NF on Column, Row, and Aggregation Errors 


Analysis of individual queries revealed that most of 3NF users’ row errors were missing joins. 
These missing join clauses resulted in 3NF users detecting a larger number of records containing no 
anomalies, i.e., extraneous records. A nested ANCOVA (GPA as covariate) indicated that the 
number of extraneous records was marginally significant as a function of the normalisation level 
(Fi 1.725 = 1.62, p = 0.0898, two-tail test) (Table 9). The means results (Table 7) confirm that INF 
users retrieved fewer extraneous records than 3NF users. 


Source R2 df Mean Square F Value Pr>F 
Model 0.1296 25 3061.34 4.32 0.0001 


Error 725 709.16 
Query 13 4516.66 6.37 0.0001 
Normalisation level (Query) 11 1145.46 1.62 0.0898 


GPA 1 9163.06 12.92 0.0003 


Table 9: Effect of Normalisation at INF and 3NF on Number of Extraneous Data Anomalies 


8. DISCUSSION AND CONCLUSIONS 

Experimental results demonstrated that the level of normalisation affects users’ ability to formulate 
database queries to find data anomalies. The results of experiment 1 showed that 1 NF and 3NF data 
structures are more desirable for retrieving data than =NF data structures, i.e., INF and 3NF users 
found more anomalies, found more anomalies per query, and attempted more questions than ~NF 
users. The results of experiment | did not, however, distinguish between user performance with 
INF and 3NF data structures. A likely cause of this outcome was that few participants attempted the 
questions for which the INF and 3NF queries differed. To distinguish between user querying 
performance with INF and 3NF data structures, participants in experiment 2 worked on questions 
with a higher proportion of queries that differed between the 1 NF and 3NF versions. 

Results of experiment 2 showed that 1 NF data structures were better for querying than 3NF data 
structures, i.e., INF users found a significantly greater percent of the seeded data anomalies and 
attempted significantly more questions than 3NF users. This experiment also revealed that, when 
composing queries to extract data anomalies, INF users made significantly fewer errors than 3NF 
users. The higher error rate for 3NF users resulted in their queries retrieving more extraneous 
records that did not contain errors. Hence, compared with INF users, 3NF users, in addition to 
incurring greater information overload, would have more work to do to separate records containing 
data anomalies from those with no data anomalies. 

Although higher levels of normalisation deter the introduction of data anomalies into a database, 
anomalies still accumulate. Data structures in 1NF provide business end-users the best likelihood of 
detecting such anomalies and only such anomalies through querying the data. Relational database 
theory, however, holds that higher levels of normalisation are desirable because they minimise data 
redundancy, which reduces the likelihood of data anomalies arising. 
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A solution to the problem of choosing between the better updates associated with high levels of 
normalisation and the better queries associated with INF is to have the best of both worlds by 
creating views for querying. The actual database can be normalised to 3NF or beyond, but views 
can be created that allow business end-users to query the data in INF. The impact of this conclusion 
is twofold. First, IS professionals need to assist or train business end-users in creating highly 
normalised data structures for updates and to make business end-users aware of the dangers of 
creating less normalised data structures. Second, IS professionals need to present business end-users 
with data views that minimise their querying difficulties. Based on this research, INF is the best 
data structure for business end-user database querying. 

This study has several limitations. First, the study used students as participants. The participants 
had, however, received training in information technology (IT) and business related subjects, and 
thus are likely to be typical of business end-users that create their own systems. Second, some 
participants may require more training in SQL, particularly those not having a strong background 
in IT. 

Future research could investigate the effects of normalisation on user satisfaction. Such research 
could provide insights into how satisfied users are with databases currently used by organisations. 
Low user satisfaction could be improved through training, use of different data structures, and 
access to views. Future research could also investigate the effect of normalisation on query errors. 
Such research could give insights into the number and type of errors users make when formulating 
queries using different data structures. 


REFERENCES 

BEYNON-DAVIES, P. (1992): Relational database design, Blackwell Scientific Publications, London. 

BORTHICK, A.F., BOWEN, P.L. and SULLIVAN, M.C. (1998): Controlling JIT II: Making the system monitor itself, 
Journal of Cost Management, 12(4): 33-41. 

BROOKSHIRE, R.G. (1993): A relational database primer, Social Science Computer Review, 11(2): 197-213. 

CAMPBELL, D.J. (1988): Task complexity: A review and analysis, Academy of Management Review, 13(1): 40-52. 

CHAN, H.C., WEI, K.K., and SIAU, K.L. (1993): User-database interface: the effect of abstraction level on query 
performance: a field experiment, MISQ, 17(4): 441-64. 

CHAN, H.C., TAN, B.C.Y., and WEI, K.K. (1999): Three important determinants of user performance for database 
retrieval, International Journal of Human Computer Studies, 51: 895-918. 

CODD, E.F. (1970): A relational model of data for large shared data banks, Communications of the ACM, 13(6): 377-87. 

DATE, C.J. (1986): Relational Database: Selected Writings, Addison-Wesley Publishing Company, Reading, 
Massachusetts. 

DATE, C.J. (1998): The birth of the relational model 3, Intelligent Enterprize, 1(4): 45-48. 

DATE, C.J. (2000): An introduction to database systems, 7th Ed. Addison-Wesley Publishing Company, Reading, 
Massachusetts. 

DAVIS, J.S. (1990): Experimental investigation of the utility of data structure and E-R diagrams in database query, 
International Journal of Human Computer Studies, 32:449-59 

ELMASRI, R. and NAVATHE, S.B. (2001): Fundamentals of database systems, 3rd Ed, Addison-Wesley Publishing 
Company, Reading, Massachusetts. 

HUH, Y.U., KELLER, F.R., REDMAN, T.C. and WATKINS, A.R. (1990): Data quality, information and software 
technology, 32(8): 559 - 565. 

ICERMAN, R.C. and HILLISON, W.A. (1991): Disposition of audit-detected anomalies: Some evidence on evaluative 
materiality, Auditing: A Journal of Practice and Theory, Spring: 22-34. 

JANSON, M.A. (1988): Data quality: The Achilles Heel of end-user computing, Omega, 16(5): 491-502. 

KAOMEA, P. and PAGE, W. (1997): A flexible information manufacturing system to the generation of tailored information 
products, Decision Support Systems, 20(4): 345-355. 

LAUDON, K.C. (1986): Data quality and due process in large interorganizational record systems, Communications of the 
ACM, 29(1): 4-11. 

MEDAWAR, K. (1995): Database quality: A literature review of the past and a plan for the future, Program, 29(3): 257-272. 

MOREY, R.C. (1982): Estimating and improving the quality of information in a MIS, Communications of the ACM, 25(5): 
337-342. 


nn ne eeaaEdtdyttdyEEEEESSSEEEESSSSSSSsssSSsSS ns sn 


Journal of Research and Practice in Information Technology, Vol. 33, No. 3, August 2001 251 




















n on the Ability of Business End-Users to Detect 





REDMAN, T.C. (1995): Improve data quality for competitive advantage, Sloan Management Review, 36(2): 99-108. 
SUH, K.S. and JENKINS, A.M. (1992): A comparison of linear keyword and restricted natural language database interfaces 
for novice users, Information Systems Research, 3(3): 252-72. 


BIOGRAPHICAL NOTES 

A.Faye Borthick, DBA, CMA, CPA, CISA, is Professor of Accountancy and Director of the Teaching 
and Learning with Technology Center, Georgia State University, Atlanta, GA, USA. For several 
decades, she has been working at the fault line of technology shifts in business and education. For 
courses in accounting and information systems, she has been developing new approaches to 
improving technology-enabled learning experiences, prompting students to solve problems with 
technology tools, and engaging students in their learning. She has been teaching completely online 
courses since 1997. 

Paul L. Bowen, PhD, is Senior Lecturer in Information Systems, University of Queensland, 
Brisbane, Australia, where he teaches information systems and auditing. His primary research 
includes developing analytical models of data quality, measuring the impact of data errors on 
decision quality, and applying artificial intelligence techniques to the identification and 
classification of data errors. He also has research interests in database design, internal control, and 
software reliability. He has been a systems analyst and project manager at Oak Ridge National 
Laboratory and has taught at The University of Tennessee and Auburn University. 
http ://www.commerce.ugq.edu/staff/bowen.phtml 

Fiona H. Rohde, PhD, is Senior Lecturer in Information Systems, University of Queensland, 
Brisbane, Australia, where she teaches Information Management and Systems Analysis and Design. 
Her research interests include the areas of outsourcing, data and information management, e- 
commerce and their affect on audit and accounting techniques. Fiona worked for KPMG Peat 
Marwick in the Computer Audit Division assisting in team audits in mining and manufacturing 
firms before joining the school approximately 10 years ago. 

Meng R. Liu, MInfSys, is a Lead Business Analyst working for Optus, Australia. He has six years 
of IT and Management Consulting experience (Ernst & Young Consulting & The LiTMUS Group) 
where he worked on consulting engagements in a variety of industries including utilities, 
telecommunications, manufacturing, banking, finance and insurance. Meng’s key areas of expertise 
are in Business Analysis, Data Warehousing, EIS, Business Process Reengineering, Business 
Performance Management, and Business Intelligence. 


a ee ae 
252 Journal of Research and Practice in Information Technology, Vol. 33, No. 3, August 2001 








APPENDIX A 


DATA STRUCTURES 
Non-Normalised ((NF) 


Attribute Name Type Description Comments 


ITEMA Table 
itemno 
vno 
vname 
idesc 
terms 
ctqtyoh 
prodjnol 
issdate 1 
qtyiss1 
qtydef1 
prodvar1 


prodjno2 
issdate2 
qtyiss2 
qtydef2 
prodvar2 


prodjno3 
issdate3 
qtyiss3 
qtydef3 
prodvar3 


recrepnol 
recdate1 
qtyrecl 
unitcostl 
paydate 1 
payamt1 


recrepno2 
recdate2 
qtyrec2 
unitcost2 
paydate2 
payamt2 


Char(6) 
Char(5) 
Char(30) 
Char(30) 
Char(5) 
Number 
Char(5) 
Date 
Number 
Number 
Number 


Char(5) 
Date 

Number 
Number 
Number 


Char(5) 
Date 

Number 
Number 
Number 


Char(6) 
Date 
Number 
Number 
Date 
Number 


Char(6) 
Date 
Number 
Number 
Date 
Number 


Item number of item 

Vendor number 

Name of the vendor 
Description of item 

Terms of payment 

Current quantity on hand 
Production job number! 

Date the item1 was issued 
Quantity of the item issued 
Quantity of the defective item 1 
Production time variance 1. 
E.g. 1.10 indicates production 
required 110% of standard time, 


i.e., a 10% unfavourable variance. 


Production job number 2 

Date the item 2 was issued 
Quantity of the item issued 
Quantity of the defective items 2 
Production time variance 2. 


Production job number 3 

Date the item 3 was issued 
Quantity of the item 3 issued 
Quantity of the defective items 3 
Production time variance 3. 


No of the first report 

Date shipment 1 received 
Quantity of item 1 received 
Cost per unit 1 

Date payment 1 made 

The amount 1 paid 


No of the second report 
Date shipment 2 received 
Quantity of item 2 received 
Cost per unit 2 

Date payment 2 made 

The amount 2 paid 


(Violates atomicity - Violates 1NF) 


(Repeating fields - Violates INF) 


(Repeating fields - Violates 1 NF) 


(payment corresponds to receiving report) 


(payment corresponds to receiving report) 
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First Normal Form (1NF) 





Attribute Name e Description 

ITEMB Table 

itemno Char(6) Item number of item 

vno Char(5) Vendor number 

vname Char(30) Name of the vendor 

idesc Char(30) Description of item 

termdisc Number Terms of payment - discount percent 

termdays Number Terms of payment - number of days discount applies 

cqtyoh Number Current quantity on hand 

ISSUESB Table 

prodjno Char(5) Production job number 

itemno Char(6) Item number of item 

issdate Date Date the item was issued 

qtyiss Number Quantity of the item issued 

qtydef Number Quantity of the defective items 

prodvar Number Effect on production time, i.e., production time variance. For example 1.10 
indicates production required 110% of standard time, i.e., a 10%unfavourable 
variance. 

RECEIPTSB Table 

recrepno Char(6) The number of the receiving report 

itemno Char(6) Item number of item 

recdate Date Date shipment received 

qtyrec Number Quantity received 

unitcost Number Cost per unit 

paydate Date Date payment was made (the payment corresponds to the receiving report) 

payamt Number The amount paid 

254 Journal of Research and Practice in Information Technology, Vol. 33, No. 3, August 2001 





Third Normal Form (3NF) 





Attribute Name Type Description 


ITEMC Table 
itemno 

idesc 

vno 

cqtyoh 


VENDORC Table 
vno 

vname 

termdisc 
termdays 


ISSUESC Table 
prodjno 

itemno 

issdate 

qtyiss 

qtydef 


PRODUCTIONC Table 
prodjno 

itemno 

prodvar 


RECEIPTSC Table 
recrepno 

itemno 

recdate 

qtyrec 

unitcost 


PAYMENTSC Table 
recrepno 

itemno 

paydate 

payamt 


Char(6) 
Char(30) 
Char(5) 
Number 


Char(5) 
Char(30) 
Number 
Number 


Char(5) 
Char(6) 
Date 

Number 
Number 


Char(5) 
Char(6) 
Number 


Char(6) 
Char(6) 
Date 

Number 
Number 


Char(6) 
Char(6) 
Date 

Number 


Item number of item 
Description of item 
Vendor number 

Current quantity on hand 


Vendor number 

Vendor name 

Terms of payment - discount percent 

Terms of payment - number of days discount applies 


Production job number 

Item number of item 

Date the item was issued 
Quantity of the item issued 
Quantity of the defective items 


Production job number 

Item number of item 

Effect on production time, i.e., production time variance. For example 1.10 
indicates production required 110% of standard time, i.e., a 10% 
unfavourable variance. 


The number of the receiving report 
Item number of item 

Date shipment received 

Quantity received 

Cost per unit 


The number of the receiving report 

Item number of item 

Date payment was made (the payment corresponds to the receiving report) 
The amount paid 
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1 the Ability of Business End-Us 


APPENDIX B 


Instructions - NF, INF and 3NF 

Purpose 

Some of the purposes of this exercise are to: 

1. Test how well you can use SQL to search for data anomalies within a database. 

2. Allow you to relate input controls to specific data anomalies by searching for data anomalies 
that input controls should have prevented. 


Task Overview 

Task 1 - Read the Scenario and Chris’s Questions. (5 mins.) 

Task 2 - Record your session (minimal time). 

Task 3 - Locate anomalies and problems Chris (your boss) head of the internal audit department, 
specifically asked you to look for. (40 mins.) 

Task 4 - Report records that contain anomalies or problems Chris specifically asked you to look 
for. (10 mins.) 

Task 5 - Use your ingenuity to locate anomalies and problems not specifically requested by Chris. 
(40 mins.) 

Task 6 - Report records that contain anomalies or problems not specifically requested by Chris. (10 
mins.) 

Task 7 - Transmit log files. (minimal time) 

Task 8 - Complete the survey. (5 mins.) 


Scenario 

Background 

Comfortable Furniture Limited manufactures household and office furniture for distribution 
throughout the world. The company operates from 9:00 am to 5:00 pm, Monday through to 
Saturday. Comfortable Furniture adopted the Just-In-Time (JIT) II method a little over a year ago 
for inventory records. JIT II involves the use of contractual agreements between an organisation 
and its suppliers where the suppliers assume direct responsibility for entire categories of the 
organisation’s inventory. Vendors: (1) provide the required items on a just-in-time basis for 
production schedules; (2) provide these items at favourable, if not preferential prices; and (3) over 
the long term, make innovations in their products, production, and pricing to match the 
organisation’s requirements. In JIT II situations, vendor representatives often occupy offices in the 
organisation’s facilities. The organisation grants the vendor’s representatives access to the 
organisation’s data and freedom to inspect physical inventory. The vendor representatives, rather 
than personnel in the organisation’s purchasing department, place orders for the materials needed 
for the organisation’s production runs. 


JIT II reduces ordering costs, delivery times, handling costs, and inventory holding expenses. For 
these systems to yield the intended benefits, suppliers must not abuse their direct ordering 
capabilities. Converting to JIT II means many of the internal controls for traditional purchasing 
procedures are eliminated. Management expects the internal auditors to query the information 
systems to determine that JIT II vendor relationships meet internal control objectives. 
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Task 


On 2 July 1995, Chris Kaniuk, the manager of the internal audit group (your boss), calls you into 
her office. She wants you to scrutinise the JIT II database using SQL SELECT queries to locate 
any data anomalies before Deer, Price, and Persnickety, the external auditors, begin their 
investigation of the system. Furthermore, Ian McMurdy, president of Comfortable Furniture, asked 
Chris to look for problems or potential problems with the JIT II system. He particularly wants to 
know about potential overpayments to vendors and whether the overpayments arose from 
anomalies or irregularities. Because you must investigate the data anyway, Chris just delegated 
you this task as well. After completing your investigation, prepare a memo (in bullet format) to 
Chris pointing out the anomalies and problems you found in the JIT II database. 


Part 1 - Answer Chris’s questions - EXPERIMENT 1 
1. Items with no item description. 

2. Negative quantities issued for items. 

3. Production variances less than 0.8 or greater than 1.5. 
4 


(a) Term discount rates greater than 5% and | 
(b) Term discount days greater than 45 days. 


Same item description for different item numbers. 

Items that have current quantity on hand less than the average quantity issued. 

Items with unit prices that have differences greater than 50% between different receipts. 
Discount [or possibly even an incorrect discount %] taken after the discount period. 
Production variance greater than 1.1 and quantity defective = 0 

10. Payment date less than receipt date plus half the number of term days 


OND UD 


11. Items that have total receipts greater than 10 times the current quantity on hand. 
12. Vendor with the highest percentage of defective items 


Part 1 - Answer Chris’s questions - EXPERIMENT 2 

1. List the item numbers, issue date, and quantity issued for issues with negative quantity issued. 

2. List vendors (numbers and names) and the items (item numbers and descriptions) they supply 
for vendors from whom we receive a 5% or greater discount. 

3. List vendors (numbers and names) and the total number of items each vendor supplies. 

4. List the item numbers for items with no item descriptions. Check for both missing descriptions 
and descriptions with blank spaces. 

5. Management wants an idea of which vendors are supplying poor quality products. List vendor 
names and the ratio of the sum of the quantity of defective items to the sum of quantity of the 
items issued. Sort the vendors by highest to lowest defective ratio. 

6. Management is very concerned about cash management and wants to know the payments that 
we have made earlier than necessary. List the vendor names, item numbers, receiving report 
numbers, receipt date, payment date, and discount days for transactions where the payment was 
made when less than half the discount days had elapsed. 

7. Management wants an estimate of the current accounts payable. Calculate the total gross 
amount payable for materials we have not paid for yet. 
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8. 


10. 


11. 


2. 


L3, 


14. 








lisation on the Ability of Business End-Users to Detect | 


List items for which the current quantity on hand is more than 50 but there is no usable item 
description, i.e., the item description is either blank or less than 3 characters. Include the 
vendor names associated with each item. 

By vendor, management wants to know the amount of discounts we have lost by not paying 
within the discount period. List vendor names, each vendors’ discount days, and the sum of the 
amount of the discounts lost. Each discount lost can be calculated as the product of the 
discount percentage times the gross dollar amount received. The gross dollar amount received 
is the product of the quantity received times the unit cost. 

JIT II arrangements typically involve obtaining particular items from only one vendor, Le: 
each vendor supplies all of the organisation’s requirements for its designated set of items. 
Management wants to know if the manufacturing staff have managed to arrange to obtain the 
same item from more than one vendor. List pairs of vendor names and pairs of item numbers 
and descriptions for items supplied by different vendors but with the same item descriptions. 
Management wants to look for items that may be overstocked. List the vendor names, item 
numbers and descriptions, current quantity on hand, and the average daily issue since | January 
2000 for items having current quantity on hand greater than 10 times the average daily issue for 
that item. Recall that the average daily issue of an item is the total amount issued for the period 
divided by the number of days in the period. 

Management is concerned that manufacturing is not recording information about production 
variances. List production job numbers, item numbers and descriptions, and issue dates for 
which production variances were missing or not recorded. 

List items for which the value of the current quantity on hand is more than $1,000 (based on 
the most recent unit cost for a receipt of that item) but there is no usable item description, 1.e., 
the item description is either blank or less than 3 characters. Include the vendor names 
associated with each item as well as the current quantity on hand of each item and the estimate 
of the value of the current quantity on hand of those items. 

Management wants to identify vendors with consistently high quality items. List vendor names 
and average defective rates (total quantity defective divided by total quantity issued) for 
vendors which supply items for which none of the items the vendor supplies has a deficiency 
rate greater than 5%. 


a 
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The Code Red Epidemic: a Case Study 


John Lamp 


Faculty of Business and Law, Deakin University, Australia 
Email: John.Lamp@deakin.edu.au 


An analysis of log files from an immune World Wide Web server was used to 
discover the patterns of infection from the Code Red worm variants. Analogies are 
drawn to biological systems. The need for protection is commented on. 

Keywords: EK10 Computer Viruses, ELO8 IS Risk Management 


INTRODUCTION 

Computer viruses have become a fact of life. Once a rarity, the increasing use of networks and 
network applications, such as email, has resulted in their rapid dispersion after creation. Their 
increasing prevalence is providing data that can be analysed to obtain information and devise 
models on rates of infection, spatial dispersion and other aspects. 


THE CODE RED WORM 

The first variant of the Code Red worm (CODERED.A) was discovered on the 19th July 2001. 
Subsequently two variations of the worm were discovered, CODERED.B on 31st July 2001, and 
CODERED.C on 6th August 2001. The worm specifically attacks web servers running Microsoft’s 
Internet Information Server (IIS) on Windows NT 4, 2000 or XP Beta platform. The worm exploits 
an unchecked buffer in the Internet Data Administration area of IIS, contained in idq.dll. It creates 
a buffer overflow and gains complete control over the server. The actions the worm then takes is 
dependent on the variant of the worm. 

Variant A is relatively benign. If the system date is before the 20th of the month, it generates 
random IP addresses and sends copies of itself to other web servers. If the system date is between 
the 20th and 28th, it executes a distributed denial of service (DDoS) attack on 198.137.240.91 
(www1.whitehouse.gov). This server has now been moved to a different IP address. It also checks 
for a file called C:\NOTWORM. If this file exists, the worm goes dormant. It also causes the server 
on which it resides to display the message “Welcome to http://www.worm.com! Hacked By 
Chinese!” when accessed. Typically you see something like the following line in the log file, where 
[240*”N”’] is a string of 240 “N’’s. 

GET /default.ida?[240*”N” ]%u9090%u6858%ucbd3 $u7801%u9090 

Variant B fixes a bug in variant A, which caused variant A to generate the same random 
addresses. It displays the hacked message only if the system language is not English. 

Variant C can be identified in the log files as it uses a string of “X’’s rather than ““N’’s. Variant C 
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seems to work properly only on Windows 2000 machines. It causes NT 4 machines to crash. Variant 
C spawns either 300 or 600 threads to infect other machines compared with the earlier variants’ 99 
threads. Variant C is also more malicious in that it additionally installs a trojan version of Explorer 
and disables system file protection by modifying the registry. It also maps the C: and D: drive to 
web accessible directories /C and /D, thus making any file on the machine web accessible. Yet 
another good reason to have directory browsing turned off as default! It does not have a dormant 
period, nor undertake a DDoS attack. (Trend, 2001a; Trend, 2001b; Trend, 2001c; eEye, 2001; 
Microsoft, 2001) 


LOG FILE ANALYSIS 
The machine that supplied the case study data is running the Sambar Web Server (Sambar, 2001) 


on a Windows 2000 Professional platform. It is not behind any firewall. The log file accumulated 
has the format in Table 1. 


Remote hostname or IP address number if DNS is not enabled/available. 
rfc931 The remote login name of the user. (This is not implemented by the Sambar 
Server). 
authuser The username of the authenticated user. This is available when using password 
protected WWW pages. 


Date and time of the request. 


The HTTP request line as it came from the client. 
status The HTTP response code returned to the client. Indicates whether or not the 
file was successfully retrieved, and if not, what error message was returned. 
bytes The number of bytes transferred. If the status is 200 and bytes are 0, the 
dynamic page size could not be determined. 
msec The time in milliseconds that it took the server to respond to the request. 
“referer” The url the client was on before requesting this url. 


The browser the client is using. 


Table 1: Sambar Server log file format (Sambar, 2001) 












A CODERED.A infection attempt was first logged on the case study machine on 20 July 2001. 
It later re-appeared on 30 July 2001, following which it has become a permanent feature, with 
CODERED.C dominant at around 150 attempts per day. In total 2,539 incidents have been recorded 
up to and including 20th August 2001. The grep utility was used to extract log file lines caused by 
virus infection attempts. These records were imported into an Access database and reformatted so 
that a cross tabulation of date against variant of virus could be run. The results were then graphed 
using Excel. (Figure 1). 

The log file for 20th of August 2001 has no entries for CODERED.A or .B, indicating it has 
entered its dormant period. CODERED.C is still being logged. Its trend is downward, but slower 
than that of variant A/B. 
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Figure 1: Code Red Infection Attempt Pattern — August 


BIOLOGICAL ANALOGUES 

Information technology is an area rich in metaphor, and the terms used to describe biological 
epidemics are often used loosely to describe the actions of computer viruses. (eg Kephart et al, 
1993, 1997) In particular, epidemiological terms, such as infectives that introduce the disease, 
susceptibles that are able to be infected, and immunes that cannot be infected, are often used. From 
the quantitative theory behind biological epidemic analysis, graphs may be constructed, which 
describe the progress of the epidemic. (Poole, 1974) (Figure 2). 


New Cases (-dx/dt) 
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Rate of contact * time (Bt) 
Figure 2: Deterministic epidemic curve for a simple epidemic (after Bailey, 1957) 
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Biological epidemic analysis is based on the number of new cases reported, and the requirement 
to report is, most often, a statutory one placed on the health profession. In this analysis, what was 
actually recorded was unsuccessful infection attempts. There is no parallel to this in classical 
epidemiology — it is not really possible to say, “That breath contained a flu virus particle that did 
not infect me!” Intuitively it might seem that the number of infection attempts should be related to 
the number of infectives in the population, but this requires a number of assumptions on how 
infectives are actually attempting to spread their infection, and needs closer examination. Certainly 
the literature in this area has a large number of mathematical models, which could be useful in 
understanding and predicting the impact of computer viruses on the normal operation of systems. 
Equally, there are a number of areas, such as modelling the spatial spread of biological epidemics, 
which have not been able to be undertaken effectively, but may be able to be addressed in the 
information technology domain. These could then be examined in the biological domain. 


CONCLUSION 

The increased prevalence of viruses and the ability to log attempts by viruses to attach machines is 
resulting in a greater amount of data, which could form the basis of new methods for modelling their 
distribution. Existing models of biological epidemics may be applicable to epidemics of computer 
viruses. In this specific case, large numbers of infection attempts are still being logged despite a 
corrective patch being issued on 18th June 2001, prior to the discovery of the virus in the wild. 
(Microsoft, 2001). 
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LAVINGTON, S. (2000): The Pegasus Story: A history of a vintage British computer. Science Museum, London, 58pp., 
(pound sterling) 6.95 (Paper Back) 


This is an excellent description of how PEGASUS was conceived, designed, built and used. It is well set out and 
compares and contrasts PEGASUS with other similar machines which were produced in the same time frame. There are 
five chapters as follows:- 

1. “What is PEGASUS?” This is a very short three pages, "scene setting" which gives a useful background of the 

technogogy available in the early 1950s and so it explains why it came to be designed as it was. 

2. “The birth of PEGASUS.” This elaborates on the design objectives and emphasises the most important point that it 
was intended to be easy for programmers to use it to achieve high-speed operation with no thought needed to 
arrange the instructions in the memory. This is in contrast to the English Electric DEUCE, which had a faster clock 
speed (see Table 5.1) but the Deuce needed the programmer to arrange the instructions on the storage drum 
optimally a tedious task which can also be data dependent. (Programmers’ time is very valuable!) The PEGASUS 
was also cheaper than the DEUCE — see Table 5.2 — and this combined with ease of programming explains why 
Ferrantis sold more than English Electric sold Deuces. 

3. “Technical description of the Ferranti PEGASUS.” Included here is a list of the machine’s orders/instructions; so 
even today a really keen student could write a program using this information. Furthermore, a PEGASUS simulator, 
which runs on a PC, has been written by members of the Computer Conservation Society and it is available for 
downloading from an internet web site, whose address is given on page iv. 

4. “PEGASUS in action.” This is a chapter which this reviewer considers to be much too short to do justice to all the 
problems which were solved using PEGASUSs’ during the late 1950s and 1960s. 

5. “PEGASUS’s place in history.” Here the Author compares PEGASUS to several other contemporary and later 
computers (Table 5.1) and explains that the machine (serial No. 25) — originally exported to Skandia, in Sweden, has 
gone full circle and is now back in the UK and is fully operational at the Science Museum in London. This 
restoration has been no mean feat considering the fact that most of the electronic components are now obsolete. 

My main criticism of the book is that it contains no references but refers the reader to a list of 70 which are held at a 
.uk Web site. I feel very strongly that any book of this type should be self-contained and should not rely on it’s readers 
having access to the internet. Nevertheless it does perform a useful role by describing, in detail, one of the UK’s most 
successful early commercially available computers. It forms a valuable adjunct to the Science Museum’s operational 
machine. 

Godfrey N. Lance 
ANU and Bermagui, NSW 


MCCANN, D and THORNE, P. Eds. (2000): The Last of the First. CSIRAC: Australia’s First Computer University of 
Melbourne Press, 176 pp., $30.00 (Paper Back) 


There are two distinct parts to this book: The first called “The CSIRAC Team’ comprises eleven contributions from people 
who were involved with the design, construction, programming and maintenance of the computer. Each chapter describes 
the contribution made by an individual to the CSIRAC project. In seven cases the chapters are autobiographical; the 
remainder are written by friends or colleagues of the person in question. Clearly there is some overlap and repetition in 
these chapters but this adds to the interest rather than detracting from it. 

The second part, of the book, comprises twenty more contributions from people who either used CSIRAC for their 
calculations or others who were involved with it. These papers were presented at a Conference, held in June, 1996 which 
was organised by one of the Editors (PT), to celebrate 40 years of computing in Melbourne. The very broad scope of the 
subjects in these chapters indicates what a valuable contribution CSIRAC made to scientific research in Australia in the 
1950s and very early 1960s. 

It is worth mentioning a few of the subjects to illustrate this point 

i) Building research - structural problems and their application to the design of multistorey structures. 

ii) Planning the Victorian Hydro-thermal Generating System. 

iii) Atomic structure of metal surfaces. 

iv) The production of solar position and radiation tables 

In addition some computing for commerce and industry at financial institutions was also performed and this is covered 
in one of the chapters by Peter Murton. 

CSIRAC was designed and built while Pearcy and Beard were working in the CSIRO Division of Radio physics and 
clearly this project was not one of the main stream activities of that Division. Hence the CSIRAC team felt some lack of 
support from management; however the fact that they persisted, and won in the end, is a great tribute to their hard work 
motivation and persistence. They were also disappointed that CSIRAC did not become the main stream computer for 
CSIRO. Rather commercially available computers were purchased in 1963 from Control Data Corporation and these were 
housed in a newly created Division of Computing Research within CSIRO. Experience, in many parts of the world, had 
shown that it was never satisfactory to use a research machine for routine calculations for too long. The designers, naturally, 
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were keen to ‘improve’ the machine and the programmers were frustrated whenever changes were made. All serious users 
wanted a stable platform for which they could write programs which would work for a long time without modification. 
Hence in this reviewer’s opinion and with some hind sight the right decision was taken. 

I read the whole of this book with considerable enjoyment and it is important for today’s computer professionals to do 
the same. Note that CSIRAC is the only ‘fully intact’ computer left in the world today, which also has many fully 
operational programs available, so it has a unique place in the history of computing and everyone hopes this will long 
continue but obviously maintenance will become an ever increasing problem. Interestingly an emulator has been written for 
modern PCs and this can be used to simulate original CSIRAC programs. 

I have one criticism of the book and this is that references and bibliographies are scattered throughout in a rather 
haphazard manner (they appear on at least six different pages) thus it is frustrating to find the reference which the chapter 
being read is talking about. I feel that the Editors could with little effort have improved the presentation and uniformity of 
the contributions by putting all references into ‘The Bibliography’ at the end of the book just before the comprehensive 
index. In this way the reader would always be able to read any reference without a deal of searching. 

Godfrey N. Lance 
ANU and Bermagui, NSW 
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